[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem

Josh Rosen (JIRA) Thu, 06 Nov 2014 11:12:59 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200686#comment-14200686
 ]


Josh Rosen commented on SPARK-677:
----------------------------------

No, it's still an issue in 1.2.0:

{code}
 def collect(self):
        """
        Return a list that contains all of the elements in this RDD.
        """
        with SCCallSiteSync(self.context) as css:
            bytesInJava = self._jrdd.collect().iterator()
        return list(self._collect_iterator_through_file(bytesInJava))

    def _collect_iterator_through_file(self, iterator):
        # Transferring lots of data through Py4J can be slow because
        # socket.readline() is inefficient.  Instead, we'll dump the data to a
        # file and read it back.
        tempFile = NamedTemporaryFile(delete=False, dir=self.ctx._temp_dir)
        tempFile.close()
        self.ctx._writeToFile(iterator, tempFile.name)
        # Read the data into Python and deserialize it:
        with open(tempFile.name, 'rb') as tempFile:
            for item in self._jrdd_deserializer.load_stream(tempFile):
                yield item
        os.unlink(tempFile.name)
{code}

> PySpark should not collect results through local filesystem
> -----------------------------------------------------------
>
>                 Key: SPARK-677
>                 URL: https://issues.apache.org/jira/browse/SPARK-677
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 0.7.0
>            Reporter: Josh Rosen
>
> Py4J is slow when transferring large arrays, so PySpark currently dumps data 
> to the disk and reads it back in order to collect() RDDs.  On large enough 
> datasets, this data will spill from the buffer cache and write to the 
> physical disk, resulting in terrible performance.
> Instead, we should stream the data from Java to Python over a local socket or 
> a FIFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem

Reply via email to