[ 
https://issues.apache.org/jira/browse/SPARK-9067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628385#comment-14628385
 ] 

Liang-Chi Hsieh edited comment on SPARK-9067 at 7/15/15 5:21 PM:
-----------------------------------------------------------------

[~knizhnik] I have opened a PR for this problem. It would be great If you can 
test it. Thanks.


was (Author: viirya):
[~knizhnik]zhnik] I have opened a PR for this problem. It would be great If you 
can test it. Thanks.

> Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-9067
>                 URL: https://issues.apache.org/jira/browse/SPARK-9067
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 1.3.0, 1.4.0
>         Environment: Target system: Linux, 16 cores, 400Gb RAM
> Spark is started locally using the following command:
> {{
> spark-submit --master local[16] --driver-memory 64G --executor-cores 16 
> --num-executors 1  --executor-memory 64G
> }}
>            Reporter: konstantin knizhnik
>
> If coalesce transformation with small number of output partitions (in my case 
> 16) is applied to large Parquet file (in my has about 150Gb with 215k 
> partitions), then it case OutOfMemory exceptions 250Gb is not enough) and 
> open file limit exhaustion (with limit set to 8k).
> The source of the problem is in SqlNewHad\oopRDD.compute method:
> {quote}
>       val reader = format.createRecordReader(
>         split.serializableHadoopSplit.value, hadoopAttemptContext)
>       reader.initialize(split.serializableHadoopSplit.value, 
> hadoopAttemptContext)
>       // Register an on-task-completion callback to close the input stream.
>       context.addTaskCompletionListener(context => close())
> {quote}
> Created Parquet file reader is intended to be closed at task completion time. 
> This reader contains a lot of references to  parquet.bytes.BytesInput object 
> which in turn contains reference sot large byte arrays (some of them are 
> several megabytes).
> As far as in case of CoalescedRDD task is completed only after processing 
> larger number of parquet files, it cause file handles exhaustion and memory 
> overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to