[
https://issues.apache.org/jira/browse/SPARK-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628496#comment-14628496
]
Reynold Xin commented on SPARK-8890:
------------------------------------
Let me explain it with an example.
Assuming the limit is 50. And let's say for the first 1000 records, we only see
50 partitions, but for record 1001, we see the 51st partition. In this case,
switch to sorting rest of the data. After sorting, data are clustered by
partition, and as a result, at any given point, we only need to keep 1
partition open.
> Reduce memory consumption for dynamic partition insert
> ------------------------------------------------------
>
> Key: SPARK-8890
> URL: https://issues.apache.org/jira/browse/SPARK-8890
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Reporter: Reynold Xin
> Priority: Critical
>
> Currently, InsertIntoHadoopFsRelation can run out of memory if the number of
> table partitions is large. The problem is that we open one output writer for
> each partition, and when data are randomized and when the number of
> partitions is large, we open a large number of output writers, leading to OOM.
> The solution here is to inject a sorting operation once the number of active
> partitions is beyond a certain point (e.g. 50?)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]