Kai Londenberg created SPARK-21881:
--------------------------------------
Summary: Again: OOM killer may leave SparkContext in broken state
causing Connection Refused errors
Key: SPARK-21881
URL: https://issues.apache.org/jira/browse/SPARK-21881
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 1.6.1, 2.0.0
Reporter: Kai Londenberg
Assignee: Alexander Shorin
Fix For: 2.1.0
When you run some memory-heavy spark job, Spark driver may consume more memory
resources than host available to provide.
In this case OOM killer comes on scene and successfully kills a spark-submit
process.
The pyspark.SparkContext is not able to handle such state of things and becomes
completely broken.
You cannot stop it as on stop it tries to call stop method of bounded java
context (jsc) and fails with Py4JError, because such process no longer exists
as like as the connection to it.
You cannot start new SparkContext because you have your broken one as active
one and pyspark still is not able to not have SparkContext as sort of singleton.
The only thing you can do is shutdown your IPython Notebook and start it over.
Or dive into SparkContext internal attributes and reset them manually to
initial None state.
The OOM killer case is just one of the many: any reason of spark-submit crash
in the middle of something leaves SparkContext in a broken state.
Example on error log on {{sc.stop()}} in broken state:
{code}
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 883,
in send_command
response = connection.send_command(command)
File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line
1040, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
Py4JNetworkError: Error while receiving
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java
server (127.0.0.1:59911)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 963,
in start
self.socket.connect((self.address, self.port))
File "/usr/local/lib/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
error: [Errno 61] Connection refused
---------------------------------------------------------------------------
Py4JError Traceback (most recent call last)
<ipython-input-2-f154e069615b> in <module>()
----> 1 sc.stop()
/usr/local/share/spark/python/pyspark/context.py in stop(self)
360 """
361 if getattr(self, "_jsc", None):
--> 362 self._jsc.stop()
363 self._jsc = None
364 if getattr(self, "_accumulatorServer", None):
/usr/local/lib/python2.7/site-packages/py4j/java_gateway.pyc in __call__(self,
*args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
/usr/local/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
43 def deco(*a, **kw):
44 try:
---> 45 return f(*a, **kw)
46 except py4j.protocol.Py4JJavaError as e:
47 s = e.java_exception.toString()
/usr/local/lib/python2.7/site-packages/py4j/protocol.pyc in
get_return_value(answer, gateway_client, target_id, name)
325 raise Py4JError(
326 "An error occurred while calling {0}{1}{2}".
--> 327 format(target_id, ".", name))
328 else:
329 type = answer[1]
Py4JError: An error occurred while calling o47.stop
{code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]