[jira] [Assigned] (SPARK-9067) Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD

Apache Spark (JIRA) Wed, 15 Jul 2015 10:21:32 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-9067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Apache Spark reassigned SPARK-9067:
-----------------------------------

    Assignee: Apache Spark

> Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-9067
>                 URL: https://issues.apache.org/jira/browse/SPARK-9067
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 1.3.0, 1.4.0
>         Environment: Target system: Linux, 16 cores, 400Gb RAM
> Spark is started locally using the following command:
> {{
> spark-submit --master local[16] --driver-memory 64G --executor-cores 16 
> --num-executors 1  --executor-memory 64G
> }}
>            Reporter: konstantin knizhnik
>            Assignee: Apache Spark
>
> If coalesce transformation with small number of output partitions (in my case 
> 16) is applied to large Parquet file (in my has about 150Gb with 215k 
> partitions), then it case OutOfMemory exceptions 250Gb is not enough) and 
> open file limit exhaustion (with limit set to 8k).
> The source of the problem is in SqlNewHad\oopRDD.compute method:
> {quote}
>       val reader = format.createRecordReader(
>         split.serializableHadoopSplit.value, hadoopAttemptContext)
>       reader.initialize(split.serializableHadoopSplit.value, 
> hadoopAttemptContext)
>       // Register an on-task-completion callback to close the input stream.
>       context.addTaskCompletionListener(context => close())
> {quote}
> Created Parquet file reader is intended to be closed at task completion time. 
> This reader contains a lot of references to  parquet.bytes.BytesInput object 
> which in turn contains reference sot large byte arrays (some of them are 
> several megabytes).
> As far as in case of CoalescedRDD task is completed only after processing 
> larger number of parquet files, it cause file handles exhaustion and memory 
> overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Assigned] (SPARK-9067) Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD

Reply via email to