konstantin knizhnik created SPARK-9067:
------------------------------------------

             Summary: Memory overflow and open file limit exhaustion for 
NewParquetRDD+CoalescedRDD
                 Key: SPARK-9067
                 URL: https://issues.apache.org/jira/browse/SPARK-9067
             Project: Spark
          Issue Type: Improvement
          Components: Input/Output
    Affects Versions: 1.4.0, 1.3.0
         Environment: Target system: Linux, 16 cores, 400Gb RAM
Spark is started locally using the following command:
{{
spark-submit --master local[16] --driver-memory 64G --executor-cores 16 
--num-executors 1  --executor-memory 64G
}}
            Reporter: konstantin knizhnik


If coalesce transformation with small number of output partitions (in my case 
16) is applied to large Parquet file (in my has about 150Gb with 215k 
partitions), then it case OutOfMemory exceptions 250Gb is not enough) and open 
file limit exhaustion (with limit set to 8k).

The source of the problem is in SqlNewHad\oopRDD.compute method:
{quote}
      val reader = format.createRecordReader(
        split.serializableHadoopSplit.value, hadoopAttemptContext)
      reader.initialize(split.serializableHadoopSplit.value, 
hadoopAttemptContext)

      // Register an on-task-completion callback to close the input stream.
      context.addTaskCompletionListener(context => close())
{quote}

Created Parquet file reader is intended to be closed at task completion time. 
This reader contains a lot of references to  parquet.bytes.BytesInput object 
which in turn contains reference sot large byte arrays (some of them are 
several megabytes).
As far as in case of CoalescedRDD task is completed only after processing 
larger number of parquet files, it cause file handles exhaustion and memory 
overflow.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to