Ben Duffield created SPARK-10635:
------------------------------------
Summary: pyspark - running on a different host
Key: SPARK-10635
URL: https://issues.apache.org/jira/browse/SPARK-10635
Project: Spark
Issue Type: Improvement
Reporter: Ben Duffield
At various points we assume we only ever talk to a driver on the same host.
e.g.
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615
We use pyspark to connect to an existing driver (i.e. do not let pyspark launch
the driver itself, but instead construct the SparkContext with the gateway and
jsc arguments.
There are a few reasons for this, but essentially it's to allow more
flexibility when running in AWS.
Before 1.3.1 we were able to monkeypatch around this:
def _load_from_socket(port, serializer):
sock = socket.socket()
sock.settimeout(3)
try:
sock.connect((host, port))
rf = sock.makefile("rb", 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
pyspark.rdd._load_from_socket = _load_from_socket
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]