[ 
https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157105#comment-15157105
 ] 

Ewan Higgs commented on SPARK-13434:
------------------------------------

Hi Sean,
This is using jmap with the {{-histo:live}} argument which, I thought, was for 
live objects only. If it's dumping non live objects too, and you have a way to 
check only live objects, let me know and I'll be happy to re-run the job.

{quote}
I'm missing what you're proposing – what is the opportunity to reduce memory 
usage?
{quote}

I'm trying to track the outstanding work from the Github issue. [~josephkb] 
suggest there:

{quote}
For 3, I should have been more specific. Tungsten makes improvements on 
DataFrames, so it should improve the performance of simple ML Pipeline 
operations like feature transformation and prediction. However, to get the same 
benefits for model training, we'll need to rewrite the algorithms to use 
DataFrames and not RDDs. Future work...
{quote}

So one proposal is to reimplement RF in terms of DataFrames.

Aside from that, I do see that many of the objects are small and suffer from 
JVM overhead. e.g. Predict is a pair of doubles yet it consumes 32 bytes. In a 
native runtime it could be 8 bytes (a pair of floats). Node consumes 52 bytes, 
but it looks like it should be possible to contain this in 41 bytes (int + 
(float, float) + float + bool + ptr + ptr + ptr).

Another issue is the concurrency. If there are multiple threads working within 
an executor, both creating trees, then they are all consuming memory at the 
same time. This is a common issue in R when using papply. Reducing the 
concurrency can help reduce memory pressure.

> Reduce Spark RandomForest memory footprint
> ------------------------------------------
>
>                 Key: SPARK-13434
>                 URL: https://issues.apache.org/jira/browse/SPARK-13434
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.6.0
>         Environment: Linux
>            Reporter: Ewan Higgs
>              Labels: decisiontree, mllib, randomforest
>         Attachments: heap-usage.log, rf-heap-usage.png
>
>
> The RandomForest implementation can easily run out of memory on moderate 
> datasets. This was raised in the a user's benchmarking game on github 
> (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there 
> was a tracking issue, but I couldn't fine one.
> Using Spark 1.6, a user of mine is running into problems running the 
> RandomForest training on largish datasets on machines with 64G memory and the 
> following in {{spark-defaults.conf}}:
> {code}
> spark.executor.cores 2
> spark.executor.instances 199
> spark.executor.memory 10240M
> {code}
> I reproduced the excessive memory use from the benchmark example (using an 
> input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell 
> --driver-memory 30G --executor-memory 30G}} and have a heap profile from a 
> single machine by running {{jmap -histo:live <spark-pid>}}. I took a sample 
> every 5 seconds and at the peak it looks like this:
> {code}
>  num     #instances         #bytes  class name
> ----------------------------------------------
>    1:       5428073     8458773496  [D
>    2:      12293653     4124641992  [I
>    3:      32508964     1820501984  org.apache.spark.mllib.tree.model.Node
>    4:      53068426     1698189632  org.apache.spark.mllib.tree.model.Predict
>    5:      72853787     1165660592  scala.Some
>    6:      16263408      910750848  
> org.apache.spark.mllib.tree.model.InformationGainStats
>    7:         72969      390492744  [B
>    8:       3327008      133080320  
> org.apache.spark.mllib.tree.impl.DTStatsAggregator
>    9:       3754500      120144000  
> scala.collection.immutable.HashMap$HashMap1
>   10:       3318349      106187168  org.apache.spark.mllib.tree.model.Split
>   11:       3534946       84838704  
> org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo
>   12:       3764745       60235920  java.lang.Integer
>   13:       3327008       53232128  
> org.apache.spark.mllib.tree.impurity.EntropyAggregator
>   14:        380804       45361144  [C
>   15:        268887       34877128  <constMethodKlass>
>   16:        268887       34431568  <methodKlass>
>   17:        908377       34042760  [Lscala.collection.immutable.HashMap;
>   18:       1100000       26400000  
> org.apache.spark.mllib.regression.LabeledPoint
>   19:       1100000       26400000  org.apache.spark.mllib.linalg.SparseVector
>   20:         20206       25979864  <constantPoolKlass>
>   21:       1000000       24000000  org.apache.spark.mllib.tree.impl.TreePoint
>   22:       1000000       24000000  
> org.apache.spark.mllib.tree.impl.BaggedPoint
>   23:        908332       21799968  
> scala.collection.immutable.HashMap$HashTrieMap
>   24:         20206       20158864  <instanceKlassKlass>
>   25:         17023       14380352  <constantPoolCacheKlass>
>   26:            16       13308288  
> [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator;
>   27:        445797       10699128  scala.Tuple2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to