[jira] [Comment Edited] (PARQUET-222) parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL

Cheng Lian (JIRA) Mon, 08 Jun 2015 07:36:14 -0700

    [ 
https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577257#comment-14577257
 ]


Cheng Lian edited comment on PARQUET-222 at 6/8/15 2:33 PM:
------------------------------------------------------------

Hey [~phatak.dev], thanks for the information.  I tried to reproduce this issue 
with the following Spark shell snippet:
{code}
import sqlContext._
import sqlContext.implicits._

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

val n = 26000
val schema = StructType((1 to n).map(i => StructField(s"f$i", IntegerType, 
nullable = false)))
val bigRow = Row((1 to n): _*)
val df = createDataFrame(sc.parallelize(bigRow :: Nil), schema)
df.coalesce(1).write.mode("overwrite").format("orc").save("file:///tmp/foo")
{code}
I was using Spark 1.4.0-SNAPSHOT.  Command line used to start the shell is:
{noformat}
./bin/spark-shell --driver-memory 4g
{noformat}
I didn't get an OOM, but it does hang like forever.  After profiling it with 
YJP, it turns out that this super wide table is somehow stressing out the query 
planner by making Spark SQL allocates a large number of small objects.  Haven't 
tried 1.3.1 yet. Will do when I got time.

I found that you had once posted this issue to Spark user mailing list.  Would 
you mind to provide a full stack trace of the OOM error?  Maybe it's more like 
a Spark SQL issue rather than a Parquet issue.


was (Author: lian cheng):
Hey [~phatak.dev], thanks for the information.  I tried to reproduce this issue 
with the following Spark shell snippet:
{code}
import sqlContext._
import sqlContext.implicits._

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

val n = 26000
val schema = StructType((1 to n).map(i => StructField(s"f$i", IntegerType, 
nullable = false)))
val bigRow = Row((1 to n): _*)
val df = createDataFrame(sc.parallelize(bigRow :: Nil), schema)
df.coalesce(1).write.mode("overwrite").format("orc").save("file:///tmp/foo")
{code}
I was using Spark 1.4.0-SNAPSHOT.  Command line used to start the shell is:
{noformat}
./bin/spark-shell --driver-memory 4g
{noformat}
I didn't get an OOM, but it does hang like forever.  After profiling it with 
YJP, it turns out that this super wide table is somehow stressing out query 
planner by making Spark SQL allocates a large number of small objects.  Haven't 
tied 1.3.1 yet. 

I found that you've posted this issue to Spark user mailing list.  Would you 
mind to provide a full stack trace of the OOM error?  Maybe it's more like a 
Spark SQL issue rather than a Parquet issue.

> parquet writer runs into OOM during writing when calling 
> DataFrame.saveAsParquetFile in Spark SQL
> -------------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-222
>                 URL: https://issues.apache.org/jira/browse/PARQUET-222
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.6.0
>            Reporter: Chaozhong Yang
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> In Spark SQL, there is a function {{saveAsParquetFile}} in {{DataFrame}} or 
> {{SchemaRDD}}. That function calls method in parquet-mr, and sometimes it 
> will fail due to the OOM error thrown by parquet-mr. We can see the exception 
> stack trace  as follows:
> {noformat}
> WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 
> 0.2 in stage 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: 
> Java heap space
>         at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87)
>         at parquet.column.values.dictionary.IntList.<init>(IntList.java:83)
>         at 
> parquet.column.values.dictionary.DictionaryValuesWriter.<init>(DictionaryValuesWriter.java:85)
>         at 
> parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionaryValuesWriter.<init>(DictionaryValuesWriter.java:549)
>         at 
> parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88)
>         at 
> parquet.column.impl.ColumnWriterImpl.<init>(ColumnWriterImpl.java:74)
>         at 
> parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68)
>         at 
> parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
>         at 
> parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:178)
>         at 
> parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
>         at 
> parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
>         at 
> parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:94)
>         at 
> parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64)
>         at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
>         at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
>         at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:304)
>         at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
>         at 
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>         at org.apache.spark.scheduler.Task.run(Task.scala:56)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> {noformat}
> By the way, there is another similar issue 
> https://issues.apache.org/jira/browse/PARQUET-99. But the reporter has closed 
> it and mark it as resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (PARQUET-222) parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL

Reply via email to