[
https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14875966#comment-14875966
]
Davies Liu commented on SPARK-10635:
------------------------------------
We have the assumption that python and JVM are in the same host in some places,
for example, sc.parallelize() will dump the data into local disk then read back
by JVM. In order to let pyspark work well while running on a different host,
there should be lots of work to do. So, officially we won't support this, but
you guys feel free to hack a way as you want.
We can leave this JIRA open, others can comment on this.
> pyspark - running on a different host
> -------------------------------------
>
> Key: SPARK-10635
> URL: https://issues.apache.org/jira/browse/SPARK-10635
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Reporter: Ben Duffield
>
> At various points we assume we only ever talk to a driver on the same host.
> e.g.
> https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615
> We use pyspark to connect to an existing driver (i.e. do not let pyspark
> launch the driver itself, but instead construct the SparkContext with the
> gateway and jsc arguments.
> There are a few reasons for this, but essentially it's to allow more
> flexibility when running in AWS.
> Before 1.3.1 we were able to monkeypatch around this:
> {code}
> def _load_from_socket(port, serializer):
> sock = socket.socket()
> sock.settimeout(3)
> try:
> sock.connect((host, port))
> rf = sock.makefile("rb", 65536)
> for item in serializer.load_stream(rf):
> yield item
> finally:
> sock.close()
> pyspark.rdd._load_from_socket = _load_from_socket
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]