[
https://issues.apache.org/jira/browse/SPARK-9067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629358#comment-14629358
]
konstantin knizhnik commented on SPARK-9067:
--------------------------------------------
Sorry, but this patch doesn't help.
Looks like close is not closing everything...
For example there is reference *parquet.io.RecordReader<T> recordReader*
in the class *parquet.hadoop.InternalParquetRecordReader*
and according to hprof dump at OutOfMemory exception it contains references to
array of *parquet.column.impl.ColumnReaderImpl* and after few indirections we
reach *parquet.column.values.bitpacking.ByteBitPackingValuesReader* which field
+encoded+ references 9Mb array.
And InternalParquetRecordReader.close method doesn't close recordReader:
{quote}
public void close() throws IOException {
if (reader != null) {
reader.close();
}
}
{quote}
Unfortunately I am not sure that it is the single place where close is not
releasing all resources. Moreover I am not sure that even if close if close is
done, it clears references to all used buffers.
> Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD
> -----------------------------------------------------------------------------
>
> Key: SPARK-9067
> URL: https://issues.apache.org/jira/browse/SPARK-9067
> Project: Spark
> Issue Type: Improvement
> Components: Input/Output
> Affects Versions: 1.3.0, 1.4.0
> Environment: Target system: Linux, 16 cores, 400Gb RAM
> Spark is started locally using the following command:
> {{
> spark-submit --master local[16] --driver-memory 64G --executor-cores 16
> --num-executors 1 --executor-memory 64G
> }}
> Reporter: konstantin knizhnik
>
> If coalesce transformation with small number of output partitions (in my case
> 16) is applied to large Parquet file (in my has about 150Gb with 215k
> partitions), then it case OutOfMemory exceptions 250Gb is not enough) and
> open file limit exhaustion (with limit set to 8k).
> The source of the problem is in SqlNewHad\oopRDD.compute method:
> {quote}
> val reader = format.createRecordReader(
> split.serializableHadoopSplit.value, hadoopAttemptContext)
> reader.initialize(split.serializableHadoopSplit.value,
> hadoopAttemptContext)
> // Register an on-task-completion callback to close the input stream.
> context.addTaskCompletionListener(context => close())
> {quote}
> Created Parquet file reader is intended to be closed at task completion time.
> This reader contains a lot of references to parquet.bytes.BytesInput object
> which in turn contains reference sot large byte arrays (some of them are
> several megabytes).
> As far as in case of CoalescedRDD task is completed only after processing
> larger number of parquet files, it cause file handles exhaustion and memory
> overflow.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]