[ 
https://issues.apache.org/jira/browse/SPARK-6120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6120:
-------------------------------------
    Description: 
When the Python DecisionTree example in the programming guide is run, it runs 
out of Java Heap Space:

{code}
scala> model.save(sc, "myModelPath")
[Stage 12:>                                                                     
                                                                   (0 + 8) / 
8]15/03/02 14:19:16 ERROR Executor: Exception in task 1.0 in stage 12.0 (TID 22)
java.lang.OutOfMemoryError: Java heap space
        at 
parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65)
        at 
parquet.bytes.CapacityByteArrayOutputStream.<init>(CapacityByteArrayOutputStream.java:57)
        at 
parquet.column.values.plain.PlainValuesWriter.<init>(PlainValuesWriter.java:45)
        at 
parquet.column.values.dictionary.DictionaryValuesWriter.<init>(DictionaryValuesWriter.java:102)
        at 
parquet.column.values.dictionary.DictionaryValuesWriter$PlainDoubleDictionaryValuesWriter.<init>(DictionaryValuesWriter.java:471)
        at 
parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:111)
        at parquet.column.impl.ColumnWriterImpl.<init>(ColumnWriterImpl.java:74)
        at 
parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68)
        at 
parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
        at 
parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:178)
        at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
        at 
parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
        at 
parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:94)
        at 
parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64)
        at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
        at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
        at 
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:620)
        at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:641)
        at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:641)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:64)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
{code}

When saving using JSON format instead of Parquet, this works.  It seems to be 
caused by Parquet requiring a lot of metadata to describe the schema.

I'm labeling this a bug since it should succeed with the default spark-shell 
settings.  Potential fixes are:
* increasing spark-shell default heap space settings (This is probably too hard 
to agree on currently.)
* not using Parquet for storage (This would be good for small examples but 
probably worse for large models, where Parquet would be more efficient than 
other formats.)
* compressing the schema (The various values in the DecisionTree model could be 
flattened into a single Seq of Double.  This may be the best option for now.)

Notes:
* This happens in both pyspark and Scala shells.
* Increasing driver memory to 1g (from the default of 512m) makes this succeed.
* Running other examples such as NaiveBayes with the default of 512m works.
* This is a bit strange in that the actual size of the saved model on disk is 
small (86K on disk for me).


  was:
When the Python DecisionTree example in the programming guide is run, it runs 
out of Java Heap Space:

{code}
scala> model.save(sc, "myModelPath")
[Stage 12:>                                                                     
                                                                   (0 + 8) / 
8]15/03/02 14:19:16 ERROR Executor: Exception in task 1.0 in stage 12.0 (TID 22)
java.lang.OutOfMemoryError: Java heap space
        at 
parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65)
        at 
parquet.bytes.CapacityByteArrayOutputStream.<init>(CapacityByteArrayOutputStream.java:57)
        at 
parquet.column.values.plain.PlainValuesWriter.<init>(PlainValuesWriter.java:45)
        at 
parquet.column.values.dictionary.DictionaryValuesWriter.<init>(DictionaryValuesWriter.java:102)
        at 
parquet.column.values.dictionary.DictionaryValuesWriter$PlainDoubleDictionaryValuesWriter.<init>(DictionaryValuesWriter.java:471)
        at 
parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:111)
        at parquet.column.impl.ColumnWriterImpl.<init>(ColumnWriterImpl.java:74)
        at 
parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68)
        at 
parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
        at 
parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:178)
        at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
        at 
parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
        at 
parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:94)
        at 
parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64)
        at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
        at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
        at 
org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:620)
        at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:641)
        at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:641)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:64)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
{code}

When saving using JSON format instead of Parquet, this works.  It seems to be 
caused by Parquet requiring a lot of metadata to describe the schema.

I'm labeling this a bug since it should succeed with the default spark-shell 
settings.  Potential fixes are:
* increasing spark-shell default heap space settings (This is probably too hard 
to agree on currently.)
* not using Parquet for storage (This would be good for small examples but 
probably worse for large models, where Parquet would be more efficient than 
other formats.)
* compressing the schema (The various values in the DecisionTree model could be 
flattened into a single Seq of Double.  This may be the best option for now.)

Notes:
* This happens in both pyspark and Scala shells.
* Increasing driver memory to 1g (from the default of 512m) makes this succeed.
* Running other examples such as NaiveBayes with the default of 512m works.



> DecisionTree.save uses too much Java heap space for default spark shell 
> settings
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-6120
>                 URL: https://issues.apache.org/jira/browse/SPARK-6120
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>
> When the Python DecisionTree example in the programming guide is run, it runs 
> out of Java Heap Space:
> {code}
> scala> model.save(sc, "myModelPath")
> [Stage 12:>                                                                   
>                                                                      (0 + 8) 
> / 8]15/03/02 14:19:16 ERROR Executor: Exception in task 1.0 in stage 12.0 
> (TID 22)
> java.lang.OutOfMemoryError: Java heap space
>       at 
> parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65)
>       at 
> parquet.bytes.CapacityByteArrayOutputStream.<init>(CapacityByteArrayOutputStream.java:57)
>       at 
> parquet.column.values.plain.PlainValuesWriter.<init>(PlainValuesWriter.java:45)
>       at 
> parquet.column.values.dictionary.DictionaryValuesWriter.<init>(DictionaryValuesWriter.java:102)
>       at 
> parquet.column.values.dictionary.DictionaryValuesWriter$PlainDoubleDictionaryValuesWriter.<init>(DictionaryValuesWriter.java:471)
>       at 
> parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:111)
>       at parquet.column.impl.ColumnWriterImpl.<init>(ColumnWriterImpl.java:74)
>       at 
> parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68)
>       at 
> parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
>       at 
> parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:178)
>       at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
>       at 
> parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
>       at 
> parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:94)
>       at 
> parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64)
>       at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
>       at 
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
>       at 
> org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:620)
>       at 
> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:641)
>       at 
> org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:641)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>       at org.apache.spark.scheduler.Task.run(Task.scala:64)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> {code}
> When saving using JSON format instead of Parquet, this works.  It seems to be 
> caused by Parquet requiring a lot of metadata to describe the schema.
> I'm labeling this a bug since it should succeed with the default spark-shell 
> settings.  Potential fixes are:
> * increasing spark-shell default heap space settings (This is probably too 
> hard to agree on currently.)
> * not using Parquet for storage (This would be good for small examples but 
> probably worse for large models, where Parquet would be more efficient than 
> other formats.)
> * compressing the schema (The various values in the DecisionTree model could 
> be flattened into a single Seq of Double.  This may be the best option for 
> now.)
> Notes:
> * This happens in both pyspark and Scala shells.
> * Increasing driver memory to 1g (from the default of 512m) makes this 
> succeed.
> * Running other examples such as NaiveBayes with the default of 512m works.
> * This is a bit strange in that the actual size of the saved model on disk is 
> small (86K on disk for me).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to