[ https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157105#comment-15157105 ]
Ewan Higgs commented on SPARK-13434: ------------------------------------ Hi Sean, This is using jmap with the {{-histo:live}} argument which, I thought, was for live objects only. If it's dumping non live objects too, and you have a way to check only live objects, let me know and I'll be happy to re-run the job. {quote} I'm missing what you're proposing – what is the opportunity to reduce memory usage? {quote} I'm trying to track the outstanding work from the Github issue. [~josephkb] suggest there: {quote} For 3, I should have been more specific. Tungsten makes improvements on DataFrames, so it should improve the performance of simple ML Pipeline operations like feature transformation and prediction. However, to get the same benefits for model training, we'll need to rewrite the algorithms to use DataFrames and not RDDs. Future work... {quote} So one proposal is to reimplement RF in terms of DataFrames. Aside from that, I do see that many of the objects are small and suffer from JVM overhead. e.g. Predict is a pair of doubles yet it consumes 32 bytes. In a native runtime it could be 8 bytes (a pair of floats). Node consumes 52 bytes, but it looks like it should be possible to contain this in 41 bytes (int + (float, float) + float + bool + ptr + ptr + ptr). Another issue is the concurrency. If there are multiple threads working within an executor, both creating trees, then they are all consuming memory at the same time. This is a common issue in R when using papply. Reducing the concurrency can help reduce memory pressure. > Reduce Spark RandomForest memory footprint > ------------------------------------------ > > Key: SPARK-13434 > URL: https://issues.apache.org/jira/browse/SPARK-13434 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.6.0 > Environment: Linux > Reporter: Ewan Higgs > Labels: decisiontree, mllib, randomforest > Attachments: heap-usage.log, rf-heap-usage.png > > > The RandomForest implementation can easily run out of memory on moderate > datasets. This was raised in the a user's benchmarking game on github > (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there > was a tracking issue, but I couldn't fine one. > Using Spark 1.6, a user of mine is running into problems running the > RandomForest training on largish datasets on machines with 64G memory and the > following in {{spark-defaults.conf}}: > {code} > spark.executor.cores 2 > spark.executor.instances 199 > spark.executor.memory 10240M > {code} > I reproduced the excessive memory use from the benchmark example (using an > input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell > --driver-memory 30G --executor-memory 30G}} and have a heap profile from a > single machine by running {{jmap -histo:live <spark-pid>}}. I took a sample > every 5 seconds and at the peak it looks like this: > {code} > num #instances #bytes class name > ---------------------------------------------- > 1: 5428073 8458773496 [D > 2: 12293653 4124641992 [I > 3: 32508964 1820501984 org.apache.spark.mllib.tree.model.Node > 4: 53068426 1698189632 org.apache.spark.mllib.tree.model.Predict > 5: 72853787 1165660592 scala.Some > 6: 16263408 910750848 > org.apache.spark.mllib.tree.model.InformationGainStats > 7: 72969 390492744 [B > 8: 3327008 133080320 > org.apache.spark.mllib.tree.impl.DTStatsAggregator > 9: 3754500 120144000 > scala.collection.immutable.HashMap$HashMap1 > 10: 3318349 106187168 org.apache.spark.mllib.tree.model.Split > 11: 3534946 84838704 > org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo > 12: 3764745 60235920 java.lang.Integer > 13: 3327008 53232128 > org.apache.spark.mllib.tree.impurity.EntropyAggregator > 14: 380804 45361144 [C > 15: 268887 34877128 <constMethodKlass> > 16: 268887 34431568 <methodKlass> > 17: 908377 34042760 [Lscala.collection.immutable.HashMap; > 18: 1100000 26400000 > org.apache.spark.mllib.regression.LabeledPoint > 19: 1100000 26400000 org.apache.spark.mllib.linalg.SparseVector > 20: 20206 25979864 <constantPoolKlass> > 21: 1000000 24000000 org.apache.spark.mllib.tree.impl.TreePoint > 22: 1000000 24000000 > org.apache.spark.mllib.tree.impl.BaggedPoint > 23: 908332 21799968 > scala.collection.immutable.HashMap$HashTrieMap > 24: 20206 20158864 <instanceKlassKlass> > 25: 17023 14380352 <constantPoolCacheKlass> > 26: 16 13308288 > [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator; > 27: 445797 10699128 scala.Tuple2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org