[ 
https://issues.apache.org/jira/browse/SPARK-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628488#comment-14628488
 ] 

Ilya Ganelin commented on SPARK-8890:
-------------------------------------

[~rxin] I want to make sure I correctly understand your solution. Are you 
proposing that if the number of active partitions is beyond 50 we repartition 
the data into 50 partitions? 

I think we could approach this differently by creating a pool of OutputWriters 
(of size 50) and only create new OutputWriters once the previous partition has 
been written. This could be handled by blocking within the outputWriterForRow 
call when the new outputWriter is created. 

Does that seem reasonable? Please let me know, thanks!

> Reduce memory consumption for dynamic partition insert
> ------------------------------------------------------
>
>                 Key: SPARK-8890
>                 URL: https://issues.apache.org/jira/browse/SPARK-8890
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Reynold Xin
>            Priority: Critical
>
> Currently, InsertIntoHadoopFsRelation can run out of memory if the number of 
> table partitions is large. The problem is that we open one output writer for 
> each partition, and when data are randomized and when the number of 
> partitions is large, we open a large number of output writers, leading to OOM.
> The solution here is to inject a sorting operation once the number of active 
> partitions is beyond a certain point (e.g. 50?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to