[
https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200686#comment-14200686
]
Josh Rosen commented on SPARK-677:
----------------------------------
No, it's still an issue in 1.2.0:
{code}
def collect(self):
"""
Return a list that contains all of the elements in this RDD.
"""
with SCCallSiteSync(self.context) as css:
bytesInJava = self._jrdd.collect().iterator()
return list(self._collect_iterator_through_file(bytesInJava))
def _collect_iterator_through_file(self, iterator):
# Transferring lots of data through Py4J can be slow because
# socket.readline() is inefficient. Instead, we'll dump the data to a
# file and read it back.
tempFile = NamedTemporaryFile(delete=False, dir=self.ctx._temp_dir)
tempFile.close()
self.ctx._writeToFile(iterator, tempFile.name)
# Read the data into Python and deserialize it:
with open(tempFile.name, 'rb') as tempFile:
for item in self._jrdd_deserializer.load_stream(tempFile):
yield item
os.unlink(tempFile.name)
{code}
> PySpark should not collect results through local filesystem
> -----------------------------------------------------------
>
> Key: SPARK-677
> URL: https://issues.apache.org/jira/browse/SPARK-677
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 0.7.0
> Reporter: Josh Rosen
>
> Py4J is slow when transferring large arrays, so PySpark currently dumps data
> to the disk and reads it back in order to collect() RDDs. On large enough
> datasets, this data will spill from the buffer cache and write to the
> physical disk, resulting in terrible performance.
> Instead, we should stream the data from Java to Python over a local socket or
> a FIFO.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]