[
https://issues.apache.org/jira/browse/SPARK-9067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628385#comment-14628385
]
Liang-Chi Hsieh commented on SPARK-9067:
----------------------------------------
[~knizhnik]zhnik] I have opened a PR for this problem. It would be great If you
can test it. Thanks.
> Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD
> -----------------------------------------------------------------------------
>
> Key: SPARK-9067
> URL: https://issues.apache.org/jira/browse/SPARK-9067
> Project: Spark
> Issue Type: Improvement
> Components: Input/Output
> Affects Versions: 1.3.0, 1.4.0
> Environment: Target system: Linux, 16 cores, 400Gb RAM
> Spark is started locally using the following command:
> {{
> spark-submit --master local[16] --driver-memory 64G --executor-cores 16
> --num-executors 1 --executor-memory 64G
> }}
> Reporter: konstantin knizhnik
>
> If coalesce transformation with small number of output partitions (in my case
> 16) is applied to large Parquet file (in my has about 150Gb with 215k
> partitions), then it case OutOfMemory exceptions 250Gb is not enough) and
> open file limit exhaustion (with limit set to 8k).
> The source of the problem is in SqlNewHad\oopRDD.compute method:
> {quote}
> val reader = format.createRecordReader(
> split.serializableHadoopSplit.value, hadoopAttemptContext)
> reader.initialize(split.serializableHadoopSplit.value,
> hadoopAttemptContext)
> // Register an on-task-completion callback to close the input stream.
> context.addTaskCompletionListener(context => close())
> {quote}
> Created Parquet file reader is intended to be closed at task completion time.
> This reader contains a lot of references to parquet.bytes.BytesInput object
> which in turn contains reference sot large byte arrays (some of them are
> several megabytes).
> As far as in case of CoalescedRDD task is completed only after processing
> larger number of parquet files, it cause file handles exhaustion and memory
> overflow.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]