[
https://issues.apache.org/jira/browse/SPARK-2876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323481#comment-14323481
]
Josh Rosen commented on SPARK-2876:
-----------------------------------
[~davies], I'm going through old PySpark issues and it looks like this one
could still be relevant: BatchedSerializer is still used in a few places and
its default batch size hasn't changed. Is there still a reason to call
BatchedSerializer directly anywhere, or can I just replace its uses with
AutoBatchedSerializer?
> RDD.partitionBy loads entire partition into memory
> --------------------------------------------------
>
> Key: SPARK-2876
> URL: https://issues.apache.org/jira/browse/SPARK-2876
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 1.0.1
> Reporter: Nathan Howell
>
> {{RDD.partitionBy}} fails due to an OOM in the PySpark daemon process when
> given a relatively large dataset. It seems that the use of
> {{BatchedSerializer(UNLIMITED_BATCH_SIZE)}} is suspect, most other RDD
> methods use {{self._jrdd_deserializer}}.
> {code}
> y = x.keyBy(...)
> z = y.partitionBy(512) # fails
> z = y.repartition(512) # succeeds
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]