[ https://issues.apache.org/jira/browse/SPARK-29856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972224#comment-16972224 ]
Enzo Bonnal edited comment on SPARK-29856 at 11/12/19 9:39 AM: --------------------------------------------------------------- Just a note: if I am not wrong, _findBestSplits_ may leverage the caching if _nodeIdCache.nonEmpty._ Have you took this into account ? was (Author: enzobnl): Just a note: if I am not wrong_, findBestSplits_ may leverage the caching if _nodeIdCache.nonEmpty._ Have you took this into account ? > Conditional unnecessary persist on RDDs in ML algorithms > -------------------------------------------------------- > > Key: SPARK-29856 > URL: https://issues.apache.org/jira/browse/SPARK-29856 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib > Affects Versions: 3.0.0 > Reporter: Dong Wang > Priority: Major > > When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD > _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is > persisted, but it only used once. So this persist operation is unnecessary. > {code:scala} > val baggedInput = BaggedPoint > .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, > withReplacement, > (tp: TreePoint) => tp.weight, seed = seed) > .persist(StorageLevel.MEMORY_AND_DISK) > ... > while (nodeStack.nonEmpty) { > ... > timer.start("findBestSplits") > RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, > nodesForGroup, > treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache) > timer.stop("findBestSplits") > } > baggedInput.unpersist() > {code} > However, the action on {color:#DE350B}_baggedInput_{color} is in a while > loop. > In GradientBoostedTreeRegressorExample, this loop only executes once, so only > one action uses {color:#DE350B}_baggedInput_{color}. > In most of ML applications, the loop will executes for many times, which > means {color:#DE350B}_baggedInput_{color} will be used in many actions. So > the persist is necessary now. > That's the point why the persist operation is "conditional" unnecessary. > Same situations exist in many other ML algorithms, e.g., RDD > {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD > {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run(). > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org