[ 
https://issues.apache.org/jira/browse/SPARK-2876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323492#comment-14323492
 ] 

Davies Liu commented on SPARK-2876:
-----------------------------------

[~joshrosen] BatchedSerializer is still needed in some places(for example, 
parallelize()), also it's base class of AutoBatchedSerializer, so we can not 
replace all of them into AutoBatchedSerializer. I should had checked that all 
the occuries of BatchedSerializer are needed.

This JIR should had been fixed by spilling the data during aggregation, it's 
not the problem of BatchedSerializer.

> RDD.partitionBy loads entire partition into memory
> --------------------------------------------------
>
>                 Key: SPARK-2876
>                 URL: https://issues.apache.org/jira/browse/SPARK-2876
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.0.1
>            Reporter: Nathan Howell
>
> {{RDD.partitionBy}} fails due to an OOM in the PySpark daemon process when 
> given a relatively large dataset. It seems that the use of 
> {{BatchedSerializer(UNLIMITED_BATCH_SIZE)}} is suspect, most other RDD 
> methods use {{self._jrdd_deserializer}}.
> {code}
> y = x.keyBy(...)
> z = y.partitionBy(512) # fails
> z = y.repartition(512) # succeeds
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to