[jira] [Updated] (SPARK-10635) pyspark - running on a different host

Ben Duffield (JIRA) Wed, 16 Sep 2015 07:50:36 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ben Duffield updated SPARK-10635:
---------------------------------
    Description: 
At various points we assume we only ever talk to a driver on the same host.
e.g. 
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615

We use pyspark to connect to an existing driver (i.e. do not let pyspark launch 
the driver itself, but instead construct the SparkContext with the gateway and 
jsc arguments.

There are a few reasons for this, but essentially it's to allow more 
flexibility when running in AWS.

Before 1.3.1 we were able to monkeypatch around this:  
{code}
    def _load_from_socket(port, serializer):
            sock = socket.socket()
            sock.settimeout(3)
            try:
                sock.connect((host, port))
                rf = sock.makefile("rb", 65536)
                for item in serializer.load_stream(rf):
                    yield item
            finally:
                sock.close()
        pyspark.rdd._load_from_socket = _load_from_socket
{/code}

  was:
At various points we assume we only ever talk to a driver on the same host.
e.g. 
https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615

We use pyspark to connect to an existing driver (i.e. do not let pyspark launch 
the driver itself, but instead construct the SparkContext with the gateway and 
jsc arguments.

There are a few reasons for this, but essentially it's to allow more 
flexibility when running in AWS.

Before 1.3.1 we were able to monkeypatch around this:  

    def _load_from_socket(port, serializer):
            sock = socket.socket()
            sock.settimeout(3)
            try:
                sock.connect((host, port))
                rf = sock.makefile("rb", 65536)
                for item in serializer.load_stream(rf):
                    yield item
            finally:
                sock.close()
        pyspark.rdd._load_from_socket = _load_from_socket



> pyspark - running on a different host
> -------------------------------------
>
>                 Key: SPARK-10635
>                 URL: https://issues.apache.org/jira/browse/SPARK-10635
>             Project: Spark
>          Issue Type: Improvement
>            Reporter: Ben Duffield
>
> At various points we assume we only ever talk to a driver on the same host.
> e.g. 
> https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615
> We use pyspark to connect to an existing driver (i.e. do not let pyspark 
> launch the driver itself, but instead construct the SparkContext with the 
> gateway and jsc arguments.
> There are a few reasons for this, but essentially it's to allow more 
> flexibility when running in AWS.
> Before 1.3.1 we were able to monkeypatch around this:  
> {code}
>     def _load_from_socket(port, serializer):
>             sock = socket.socket()
>             sock.settimeout(3)
>             try:
>                 sock.connect((host, port))
>                 rf = sock.makefile("rb", 65536)
>                 for item in serializer.load_stream(rf):
>                     yield item
>             finally:
>                 sock.close()
>         pyspark.rdd._load_from_socket = _load_from_socket
> {/code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-10635) pyspark - running on a different host

Reply via email to