[jira] [Commented] (SPARK-9067) Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD

konstantin knizhnik (JIRA) Thu, 16 Jul 2015 00:27:03 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-9067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629358#comment-14629358
 ]


konstantin knizhnik commented on SPARK-9067:
--------------------------------------------

Sorry, but this patch doesn't help.
Looks like close is not closing everything...
For example there is reference  *parquet.io.RecordReader<T> recordReader*
 in the class *parquet.hadoop.InternalParquetRecordReader*
and according to hprof dump at OutOfMemory exception it contains references to 
array of *parquet.column.impl.ColumnReaderImpl* and after few indirections we 
reach *parquet.column.values.bitpacking.ByteBitPackingValuesReader* which field 
+encoded+ references 9Mb array.
And InternalParquetRecordReader.close method doesn't close recordReader:

{quote}
  public void close() throws IOException {
    if (reader != null) {
      reader.close();
    }
  }
{quote}

Unfortunately I am not sure that it is the single place where close is not 
releasing all resources. Moreover I am not sure that even if close if close is 
done, it clears references to all used buffers. 

> Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-9067
>                 URL: https://issues.apache.org/jira/browse/SPARK-9067
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 1.3.0, 1.4.0
>         Environment: Target system: Linux, 16 cores, 400Gb RAM
> Spark is started locally using the following command:
> {{
> spark-submit --master local[16] --driver-memory 64G --executor-cores 16 
> --num-executors 1  --executor-memory 64G
> }}
>            Reporter: konstantin knizhnik
>
> If coalesce transformation with small number of output partitions (in my case 
> 16) is applied to large Parquet file (in my has about 150Gb with 215k 
> partitions), then it case OutOfMemory exceptions 250Gb is not enough) and 
> open file limit exhaustion (with limit set to 8k).
> The source of the problem is in SqlNewHad\oopRDD.compute method:
> {quote}
>       val reader = format.createRecordReader(
>         split.serializableHadoopSplit.value, hadoopAttemptContext)
>       reader.initialize(split.serializableHadoopSplit.value, 
> hadoopAttemptContext)
>       // Register an on-task-completion callback to close the input stream.
>       context.addTaskCompletionListener(context => close())
> {quote}
> Created Parquet file reader is intended to be closed at task completion time. 
> This reader contains a lot of references to  parquet.bytes.BytesInput object 
> which in turn contains reference sot large byte arrays (some of them are 
> several megabytes).
> As far as in case of CoalescedRDD task is completed only after processing 
> larger number of parquet files, it cause file handles exhaustion and memory 
> overflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-9067) Memory overflow and open file limit exhaustion for NewParquetRDD+CoalescedRDD

Reply via email to