[
https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577257#comment-14577257
]
Cheng Lian edited comment on PARQUET-222 at 6/8/15 2:33 PM:
------------------------------------------------------------
Hey [~phatak.dev], thanks for the information. I tried to reproduce this issue
with the following Spark shell snippet:
{code}
import sqlContext._
import sqlContext.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val n = 26000
val schema = StructType((1 to n).map(i => StructField(s"f$i", IntegerType,
nullable = false)))
val bigRow = Row((1 to n): _*)
val df = createDataFrame(sc.parallelize(bigRow :: Nil), schema)
df.coalesce(1).write.mode("overwrite").format("orc").save("file:///tmp/foo")
{code}
I was using Spark 1.4.0-SNAPSHOT. Command line used to start the shell is:
{noformat}
./bin/spark-shell --driver-memory 4g
{noformat}
I didn't get an OOM, but it does hang like forever. After profiling it with
YJP, it turns out that this super wide table is somehow stressing out the query
planner by making Spark SQL allocates a large number of small objects. Haven't
tried 1.3.1 yet. Will do when I got time.
I found that you had once posted this issue to Spark user mailing list. Would
you mind to provide a full stack trace of the OOM error? Maybe it's more like
a Spark SQL issue rather than a Parquet issue.
was (Author: lian cheng):
Hey [~phatak.dev], thanks for the information. I tried to reproduce this issue
with the following Spark shell snippet:
{code}
import sqlContext._
import sqlContext.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val n = 26000
val schema = StructType((1 to n).map(i => StructField(s"f$i", IntegerType,
nullable = false)))
val bigRow = Row((1 to n): _*)
val df = createDataFrame(sc.parallelize(bigRow :: Nil), schema)
df.coalesce(1).write.mode("overwrite").format("orc").save("file:///tmp/foo")
{code}
I was using Spark 1.4.0-SNAPSHOT. Command line used to start the shell is:
{noformat}
./bin/spark-shell --driver-memory 4g
{noformat}
I didn't get an OOM, but it does hang like forever. After profiling it with
YJP, it turns out that this super wide table is somehow stressing out query
planner by making Spark SQL allocates a large number of small objects. Haven't
tied 1.3.1 yet.
I found that you've posted this issue to Spark user mailing list. Would you
mind to provide a full stack trace of the OOM error? Maybe it's more like a
Spark SQL issue rather than a Parquet issue.
> parquet writer runs into OOM during writing when calling
> DataFrame.saveAsParquetFile in Spark SQL
> -------------------------------------------------------------------------------------------------
>
> Key: PARQUET-222
> URL: https://issues.apache.org/jira/browse/PARQUET-222
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.6.0
> Reporter: Chaozhong Yang
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> In Spark SQL, there is a function {{saveAsParquetFile}} in {{DataFrame}} or
> {{SchemaRDD}}. That function calls method in parquet-mr, and sometimes it
> will fail due to the OOM error thrown by parquet-mr. We can see the exception
> stack trace as follows:
> {noformat}
> WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task
> 0.2 in stage 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError:
> Java heap space
> at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87)
> at parquet.column.values.dictionary.IntList.<init>(IntList.java:83)
> at
> parquet.column.values.dictionary.DictionaryValuesWriter.<init>(DictionaryValuesWriter.java:85)
> at
> parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionaryValuesWriter.<init>(DictionaryValuesWriter.java:549)
> at
> parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88)
> at
> parquet.column.impl.ColumnWriterImpl.<init>(ColumnWriterImpl.java:74)
> at
> parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68)
> at
> parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
> at
> parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:178)
> at
> parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
> at
> parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
> at
> parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:94)
> at
> parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64)
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
> at
> org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:304)
> at
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
> at
> org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> {noformat}
> By the way, there is another similar issue
> https://issues.apache.org/jira/browse/PARQUET-99. But the reporter has closed
> it and mark it as resolved.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)