[jira] [Commented] (SPARK-1343) PySpark OOMs without caching

Davies Liu (JIRA) Mon, 28 Jul 2014 15:22:11 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077011#comment-14077011
 ]


Davies Liu commented on SPARK-1343:
-----------------------------------

Maybe it's related to partitionBy() with small number of partitions, the data 
in one partition will send to JVM as several huge bytearray, they will cost 
huge memory before writing into disks, because default 
spark.serializer.objectStreamReset is too large.

Hopefully, PR-1568 and PR-1460 will fix these issues.

Close this now, will re-open it if it happens again.

> PySpark OOMs without caching
> ----------------------------
>
>                 Key: SPARK-1343
>                 URL: https://issues.apache.org/jira/browse/SPARK-1343
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 0.9.0
>            Reporter: Matei Zaharia
>
> There have been several reports on the list of PySpark 0.9 OOMing even if it 
> does simple maps and counts, whereas 0.9 didn't. This may be due to either 
> the batching added to serialization, or due to invalid serialized data which 
> makes the Java side allocate an overly large array. Needs investigating for 
> 1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1343) PySpark OOMs without caching

Reply via email to