[
https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200514#comment-14200514
]
Matei Zaharia commented on SPARK-677:
-------------------------------------
[~joshrosen] is this fixed now?
> PySpark should not collect results through local filesystem
> -----------------------------------------------------------
>
> Key: SPARK-677
> URL: https://issues.apache.org/jira/browse/SPARK-677
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 0.7.0
> Reporter: Josh Rosen
>
> Py4J is slow when transferring large arrays, so PySpark currently dumps data
> to the disk and reads it back in order to collect() RDDs. On large enough
> datasets, this data will spill from the buffer cache and write to the
> physical disk, resulting in terrible performance.
> Instead, we should stream the data from Java to Python over a local socket or
> a FIFO.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]