[ 
https://issues.apache.org/jira/browse/SPARK-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628620#comment-14628620
 ] 

Ilya Ganelin edited comment on SPARK-8890 at 7/15/15 7:55 PM:
--------------------------------------------------------------

[~yhuai] That makes sense, thank you. Wouldn't we still need to close/delete 
output buffers for keys that have been completely written? Thus, would we, for 
example, write all values associated with key=1, then close that output buffer, 
write the next one etc.

Operational flow would become:
1) Attempt to create  new outputWriter for each possible key
2) When maximum is exceeded, stop outputting rows.
3) Sort all remaining data by key (and persist this sorted set of InternalRow 
objects in memory. 
4) One key at a time, create an outputWriter and write all rows associated with 
that key
5) Close outputWriter for that key and open a new outputWriter, continue from 
step 4. 



was (Author: ilganeli):
[~yhuai] That makes sense, thank you. Wouldn't we still need to close/delete 
output buffers for keys that have been completely written? Thus, would we, for 
example, write all values associated with key=1, then close that output buffer, 
write the next one etc.

Operational flow would become:
1) Attempt to create  new outputWriter for each possible key
2) When maximum is exceeded, stop outputting rows.
3) Sort all remaining data by key (and persist this sorted set of {InternalRow} 
objects in memory. 
4) One key at a time, create an outputWriter and write all rows associated with 
that key
5) Close outputWriter for that key and open a new outputWriter, continue from 
step 4. 


> Reduce memory consumption for dynamic partition insert
> ------------------------------------------------------
>
>                 Key: SPARK-8890
>                 URL: https://issues.apache.org/jira/browse/SPARK-8890
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Reynold Xin
>            Priority: Critical
>
> Currently, InsertIntoHadoopFsRelation can run out of memory if the number of 
> table partitions is large. The problem is that we open one output writer for 
> each partition, and when data are randomized and when the number of 
> partitions is large, we open a large number of output writers, leading to OOM.
> The solution here is to inject a sorting operation once the number of active 
> partitions is beyond a certain point (e.g. 50?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to