[
https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580339#comment-14580339
]
Cheng Lian edited comment on PARQUET-222 at 6/10/15 4:39 PM:
-------------------------------------------------------------
Hey [~phatak.dev], finally got some time to try 1.3.1 and reproduced this OOM.
While trying this case with 1.4, it got stuck in the query planner, so I was
adjusting {{\-\-driver-memory}}. In the case of 1.3.1, by tuning
{{\-\-executor-memory}}, I can see two kinds of exceptions. The first one is
exactly the same as what you saw. In my test code, I create 26k INT columns,
so Parquet tries to initialize 26k column writers, each allocates a default
slab (an {{int[]}}) with 64k elements. This takes at least {{26k * 64k * 4b =
6.34gb}} memory.
After increasing executor memory to 10g, I saw similar exception thrown from
{{RunLengthBitPackingHybridEncoder}}. I guess Parquet is trying to allocate an
RLE encoder for each column here to perform compression (not 100% sure about
this for now). Similarly, each encoder initializes a default slab (a
{{byte[]}}) with at least 64k elements, and that's another {{26k * 64k * 1b =
1.6gb}} memory.
Only have a laptop for now, so... not sure how much memory it needs to write
such wide a table. But essentially Parquet needs to pre-allocate some memory
for each column to compress and buffer data. And 26k columns altogether just
eats too much memory here. That's why even your table has only a single row,
it still causes OOM.
was (Author: lian cheng):
Hey [~phatak.dev], finally got some time to try 1.3.1 and reproduced this OOM.
While trying this case with 1.4, it got stuck in the query planner, so I was
adjusting {{--driver-memory}}. In the case of 1.3.1, by tuning
{{--executor-memory}}, I can see two kinds of exceptions. The first one is
exactly the same as what you saw. In my test code, I create 26k INT columns,
so Parquet tries to initialize 26k column writers, each allocates a default
slab (an {{int[]}}) with 64k elements. This takes at least {{26k * 64k * 4b =
6.34gb}} memory.
After increasing executor memory to 10g, I saw similar exception thrown from
{{RunLengthBitPackingHybridEncoder}}. I guess Parquet is trying to allocate an
RLE encoder for each column here to perform compression (not 100% sure about
this for now). Similarly, each encoder initializes a default slab (a
{{byte[]}}) with at least 64k elements, and that's another {{26k * 64k * 1b =
1.6gb}} memory.
Only have a laptop for now, so... not sure how much memory it needs to write
such wide a table. But essentially Parquet needs to pre-allocate some memory
for each column to compress and buffer data. And 26k columns altogether just
eats too much memory here. That's why even your table has only a single row,
it still causes OOM.
> parquet writer runs into OOM during writing when calling
> DataFrame.saveAsParquetFile in Spark SQL
> -------------------------------------------------------------------------------------------------
>
> Key: PARQUET-222
> URL: https://issues.apache.org/jira/browse/PARQUET-222
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.6.0
> Reporter: Chaozhong Yang
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> In Spark SQL, there is a function {{saveAsParquetFile}} in {{DataFrame}} or
> {{SchemaRDD}}. That function calls method in parquet-mr, and sometimes it
> will fail due to the OOM error thrown by parquet-mr. We can see the exception
> stack trace as follows:
> {noformat}
> WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task
> 0.2 in stage 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError:
> Java heap space
> at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87)
> at parquet.column.values.dictionary.IntList.<init>(IntList.java:83)
> at
> parquet.column.values.dictionary.DictionaryValuesWriter.<init>(DictionaryValuesWriter.java:85)
> at
> parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionaryValuesWriter.<init>(DictionaryValuesWriter.java:549)
> at
> parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88)
> at
> parquet.column.impl.ColumnWriterImpl.<init>(ColumnWriterImpl.java:74)
> at
> parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68)
> at
> parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
> at
> parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:178)
> at
> parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
> at
> parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
> at
> parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:94)
> at
> parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64)
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
> at
> org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:304)
> at
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
> at
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> {noformat}
> By the way, there is another similar issue
> https://issues.apache.org/jira/browse/PARQUET-99. But the reporter has closed
> it and mark it as resolved.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)