Chaozhong Yang created PARQUET-222:
--------------------------------------

             Summary: parquet writer runs into OOM during writing when calling 
DataFrame.saveAsParquetFile
                 Key: PARQUET-222
                 URL: https://issues.apache.org/jira/browse/PARQUET-222
             Project: Parquet
          Issue Type: Bug
          Components: parquet-mr
    Affects Versions: 1.6.0
            Reporter: Chaozhong Yang


In Spark SQL, there is an function `saveAsParquetFile` in DataFrame or 
SchemaRDD. That function calls method in parquet-mr, and sometimes it will fail 
due to the OOM error thrown by parquet-mr. We can see the exception stack trace 
 as follows:

WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 
0.2 in stag
e 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: Java heap 
space
        at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87)
        at parquet.column.values.dictionary.IntList.<init>(IntList.java:83)
        at 
parquet.column.values.dictionary.DictionaryValuesWriter.<init>(DictionaryValue
sWriter.java:85)
        at 
parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionary
ValuesWriter.<init>(DictionaryValuesWriter.java:549)
        at 
parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88)
        at parquet.column.impl.ColumnWriterImpl.<init>(ColumnWriterImpl.java:74)
        at 
parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.jav
a:68)
        at 
parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.
java:56)
        at 
parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnI
O.java:178)
        at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
        at 
parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWrit
er.java:108)
        at 
parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.
java:94)
        at 
parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64)
        at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:28
2)
        at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:25
2)
        at 
org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parqu
et$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:304)
        at 
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$
1.apply(ParquetTableOperations.scala:325)
        at 
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$
1.apply(ParquetTableOperations.scala:325)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:56)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java
:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908
)
        at java.lang.Thread.run(Thread.java:662)

By the way, there is another similar issue 
https://issues.apache.org/jira/browse/PARQUET-99. But the reporter has closed 
it and mark it as resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to