[ https://issues.apache.org/jira/browse/SPARK-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628620#comment-14628620 ]
Ilya Ganelin edited comment on SPARK-8890 at 7/15/15 7:55 PM: -------------------------------------------------------------- [~yhuai] That makes sense, thank you. Wouldn't we still need to close/delete output buffers for keys that have been completely written? Thus, would we, for example, write all values associated with key=1, then close that output buffer, write the next one etc. Operational flow would become: 1) Attempt to create new outputWriter for each possible key 2) When maximum is exceeded, stop outputting rows. 3) Sort all remaining data by key (and persist this sorted set of InternalRow objects in memory. 4) One key at a time, create an outputWriter and write all rows associated with that key 5) Close outputWriter for that key and open a new outputWriter, continue from step 4. was (Author: ilganeli): [~yhuai] That makes sense, thank you. Wouldn't we still need to close/delete output buffers for keys that have been completely written? Thus, would we, for example, write all values associated with key=1, then close that output buffer, write the next one etc. Operational flow would become: 1) Attempt to create new outputWriter for each possible key 2) When maximum is exceeded, stop outputting rows. 3) Sort all remaining data by key (and persist this sorted set of {InternalRow} objects in memory. 4) One key at a time, create an outputWriter and write all rows associated with that key 5) Close outputWriter for that key and open a new outputWriter, continue from step 4. > Reduce memory consumption for dynamic partition insert > ------------------------------------------------------ > > Key: SPARK-8890 > URL: https://issues.apache.org/jira/browse/SPARK-8890 > Project: Spark > Issue Type: Sub-task > Components: SQL > Reporter: Reynold Xin > Priority: Critical > > Currently, InsertIntoHadoopFsRelation can run out of memory if the number of > table partitions is large. The problem is that we open one output writer for > each partition, and when data are randomized and when the number of > partitions is large, we open a large number of output writers, leading to OOM. > The solution here is to inject a sorting operation once the number of active > partitions is beyond a certain point (e.g. 50?) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org