Michael Bieniosek created SPARK-6698: ----------------------------------------
Summary: RandomForest.scala (et al) hardcodes usage of StorageLevel.MEMORY_AND_DISK Key: SPARK-6698 URL: https://issues.apache.org/jira/browse/SPARK-6698 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.0 Reporter: Michael Bieniosek In RandomForest.scala the feature input is persisted with StorageLevel.MEMORY_AND_DISK during the bagging phase, even if the bagging rate is set at 100%. This forces the RDD to be stored unserialized, which causes major JVM GC headaches if the RDD is sizable. Something similar happens in NodeIdCache.scala though I believe in this case the RDD is smaller. A simple fix would be to use the same StorageLevel as the input RDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org