[
https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802958#comment-14802958
]
Patrick Woody commented on SPARK-10635:
---------------------------------------
For a bit of motivation - we have a long running SparkContext that essentially
acts as a query server with many clients in iPython Notebook.
We want to keep the driver on a different box from the python kernels to
protect it from potentially resource-heavy python processes (we've had OOM
killer issues in the past). It seems reasonable via py4j, but we are running
into the above issues post-1.4.
> pyspark - running on a different host
> -------------------------------------
>
> Key: SPARK-10635
> URL: https://issues.apache.org/jira/browse/SPARK-10635
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Reporter: Ben Duffield
>
> At various points we assume we only ever talk to a driver on the same host.
> e.g.
> https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615
> We use pyspark to connect to an existing driver (i.e. do not let pyspark
> launch the driver itself, but instead construct the SparkContext with the
> gateway and jsc arguments.
> There are a few reasons for this, but essentially it's to allow more
> flexibility when running in AWS.
> Before 1.3.1 we were able to monkeypatch around this:
> {code}
> def _load_from_socket(port, serializer):
> sock = socket.socket()
> sock.settimeout(3)
> try:
> sock.connect((host, port))
> rf = sock.makefile("rb", 65536)
> for item in serializer.load_stream(rf):
> yield item
> finally:
> sock.close()
> pyspark.rdd._load_from_socket = _load_from_socket
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]