zhengruifeng created SPARK-23841:
------------------------------------

             Summary: NodeIdCache should unpersist the last cached 
nodeIdsForInstances
                 Key: SPARK-23841
                 URL: https://issues.apache.org/jira/browse/SPARK-23841
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 2.4.0
            Reporter: zhengruifeng


{{{{NodeIdCache}}}} forget to unpersist the last cached intermediate dataset:

 
{code:java}
scala> import org.apache.spark.ml.classification._
import org.apache.spark.ml.classification._

scala> val df = 
spark.read.format("libsvm").load("/Users/zrf/Dev/OpenSource/spark/data/mllib/sample_libsvm_data.txt")
2018-04-02 11:48:25 WARN  LibSVMFileFormat:66 - 'numFeatures' option not 
specified, determining the number of features by going though the input. If you 
know the number in advance, please specify it via 'numFeatures' option to avoid 
the extra scan.
2018-04-02 11:48:31 WARN  ObjectStore:568 - Failed to get database global_temp, 
returning NoSuchObjectException
df: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> val rf = new RandomForestClassifier().setCacheNodeIds(true)
rf: org.apache.spark.ml.classification.RandomForestClassifier = rfc_aab2b672546b


scala> val rfm = rf.fit(df)
rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = 
RandomForestClassificationModel (uid=rfc_aab2b672546b) with 20 trees

scala> sc.getPersistentRDDs
res0: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map(56 -> 
MapPartitionsRDD[56] at map at NodeIdCache.scala:102){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to