[ https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15156987#comment-15156987 ]
Sean Owen commented on SPARK-13434: ----------------------------------- I'm missing what you're proposing -- what is the opportunity to reduce memory usage? The contents of the heap aren't all necessarily live. Is this after a GC? > Reduce Spark RandomForest memory footprint > ------------------------------------------ > > Key: SPARK-13434 > URL: https://issues.apache.org/jira/browse/SPARK-13434 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.6.0 > Environment: Linux > Reporter: Ewan Higgs > Labels: decisiontree, mllib, randomforest > Attachments: heap-usage.log, rf-heap-usage.png > > > The RandomForest implementation can easily run out of memory on moderate > datasets. This was raised in the a user's benchmarking game on github > (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there > was a tracking issue, but I couldn't fine one. > Using Spark 1.6, a user of mine is running into problems running the > RandomForest training on largish datasets on machines with 64G memory and the > following in {{spark-defaults.conf}}: > {code} > spark.executor.cores 2 > spark.executor.instances 199 > spark.executor.memory 10240M > {code} > I reproduced the excessive memory use from the benchmark example (using an > input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell > --driver-memory 30G --executor-memory 30G}} and have a heap profile from a > single machine by running {{jmap -histo:live <spark-pid>}}. I took a sample > every 5 seconds and at the peak it looks like this: > {code} > num #instances #bytes class name > ---------------------------------------------- > 1: 5428073 8458773496 [D > 2: 12293653 4124641992 [I > 3: 32508964 1820501984 org.apache.spark.mllib.tree.model.Node > 4: 53068426 1698189632 org.apache.spark.mllib.tree.model.Predict > 5: 72853787 1165660592 scala.Some > 6: 16263408 910750848 > org.apache.spark.mllib.tree.model.InformationGainStats > 7: 72969 390492744 [B > 8: 3327008 133080320 > org.apache.spark.mllib.tree.impl.DTStatsAggregator > 9: 3754500 120144000 > scala.collection.immutable.HashMap$HashMap1 > 10: 3318349 106187168 org.apache.spark.mllib.tree.model.Split > 11: 3534946 84838704 > org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo > 12: 3764745 60235920 java.lang.Integer > 13: 3327008 53232128 > org.apache.spark.mllib.tree.impurity.EntropyAggregator > 14: 380804 45361144 [C > 15: 268887 34877128 <constMethodKlass> > 16: 268887 34431568 <methodKlass> > 17: 908377 34042760 [Lscala.collection.immutable.HashMap; > 18: 1100000 26400000 > org.apache.spark.mllib.regression.LabeledPoint > 19: 1100000 26400000 org.apache.spark.mllib.linalg.SparseVector > 20: 20206 25979864 <constantPoolKlass> > 21: 1000000 24000000 org.apache.spark.mllib.tree.impl.TreePoint > 22: 1000000 24000000 > org.apache.spark.mllib.tree.impl.BaggedPoint > 23: 908332 21799968 > scala.collection.immutable.HashMap$HashTrieMap > 24: 20206 20158864 <instanceKlassKlass> > 25: 17023 14380352 <constantPoolCacheKlass> > 26: 16 13308288 > [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator; > 27: 445797 10699128 scala.Tuple2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org