[jira] [Comment Edited] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms

Enzo Bonnal (Jira) Tue, 12 Nov 2019 01:40:48 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-29856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972224#comment-16972224
 ]


Enzo Bonnal edited comment on SPARK-29856 at 11/12/19 9:39 AM:
---------------------------------------------------------------

Just a note: if I am not wrong, _findBestSplits_ may leverage the caching if 
_nodeIdCache.nonEmpty._ Have you took this into account ?


was (Author: enzobnl):
Just a note: if I am not wrong_, findBestSplits_ may leverage the caching if 
_nodeIdCache.nonEmpty._ Have you took this into account ?

> Conditional unnecessary persist on RDDs in ML algorithms
> --------------------------------------------------------
>
>                 Key: SPARK-29856
>                 URL: https://issues.apache.org/jira/browse/SPARK-29856
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>    Affects Versions: 3.0.0
>            Reporter: Dong Wang
>            Priority: Major
>
> When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD 
> _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is 
> persisted, but it only used once. So this persist operation is unnecessary.
> {code:scala}
>     val baggedInput = BaggedPoint
>       .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, 
> withReplacement,
>         (tp: TreePoint) => tp.weight, seed = seed)
>       .persist(StorageLevel.MEMORY_AND_DISK)
>       ...
>    while (nodeStack.nonEmpty) {
>       ...
>       timer.start("findBestSplits")
>       RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, 
> nodesForGroup,
>         treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache)
>       timer.stop("findBestSplits")
>     }
>     baggedInput.unpersist()
> {code}
> However, the action on {color:#DE350B}_baggedInput_{color} is in a while 
> loop. 
> In GradientBoostedTreeRegressorExample, this loop only executes once, so only 
> one action uses {color:#DE350B}_baggedInput_{color}.
> In most of ML applications, the loop will executes for many times, which 
> means {color:#DE350B}_baggedInput_{color} will be used in many actions. So 
> the persist is necessary now.
> That's the point why the persist operation is "conditional" unnecessary.
> Same situations exist in many other ML algorithms, e.g., RDD 
> {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD 
> {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run().
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms

Reply via email to