Devesh Parekh created SPARK-5809:
------------------------------------
Summary: OutOfMemoryError in logDebug in RandomForest.scala
Key: SPARK-5809
URL: https://issues.apache.org/jira/browse/SPARK-5809
Project: Spark
Issue Type: Bug
Components: MLlib
Affects Versions: 1.2.0
Reporter: Devesh Parekh
When training a GBM on sparse vectors produced by HashingTF, I get the
following OutOfMemoryError, where RandomForest is building a debug string to
log.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3326)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121
)
at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at
scala.collection.mutable.StringBuilder.append(StringBuilder.scala:197)
at
scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:327
)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at
scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:320)
at scala.collection.AbstractTraversable.addString(Traversable.scala:105)
at
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:286)
at scala.collection.AbstractTraversable.mkString(Traversable.scala:105)
at
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:288)
at scala.collection.AbstractTraversable.mkString(Traversable.scala:105)
at
org.apache.spark.mllib.tree.RandomForest$$anonfun$run$9.apply(RandomForest.scala:152)
at
org.apache.spark.mllib.tree.RandomForest$$anonfun$run$9.apply(RandomForest.scala:152)
at org.apache.spark.Logging$class.logDebug(Logging.scala:63)
at
org.apache.spark.mllib.tree.RandomForest.logDebug(RandomForest.scala:67)
at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:150)
at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:64)
at
org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150)
at
org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63)
at
org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96)
A workaround until this is fixed is to modify log4j.properties in the conf
directory to filter out debug logs in RandomForest. For example:
log4j.logger.org.apache.spark.mllib.tree.RandomForest=WARN
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]