[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem

Matei Zaharia (JIRA) Thu, 06 Nov 2014 09:36:09 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200514#comment-14200514
 ]


Matei Zaharia commented on SPARK-677:
-------------------------------------

[~joshrosen] is this fixed now?

> PySpark should not collect results through local filesystem
> -----------------------------------------------------------
>
>                 Key: SPARK-677
>                 URL: https://issues.apache.org/jira/browse/SPARK-677
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 0.7.0
>            Reporter: Josh Rosen
>
> Py4J is slow when transferring large arrays, so PySpark currently dumps data 
> to the disk and reads it back in order to collect() RDDs.  On large enough 
> datasets, this data will spill from the buffer cache and write to the 
> physical disk, resulting in terrible performance.
> Instead, we should stream the data from Java to Python over a local socket or 
> a FIFO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-677) PySpark should not collect results through local filesystem

Reply via email to