[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence

WeichenXu123 Wed, 13 Sep 2017 03:08:49 -0700

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19186#discussion_r138577518
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
    @@ -483,24 +488,24 @@ class LogisticRegression @Since("1.2.0") (
         this
       }
     
    -  override protected[spark] def train(dataset: Dataset[_]): 
LogisticRegressionModel = {
    -    val handlePersistence = dataset.storageLevel == StorageLevel.NONE
    -    train(dataset, handlePersistence)
    -  }
    -
    -  protected[spark] def train(
    -      dataset: Dataset[_],
    -      handlePersistence: Boolean): LogisticRegressionModel = {
    +  protected[spark] def train(dataset: Dataset[_]): LogisticRegressionModel 
= {
         val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) 
else col($(weightCol))
         val instances: RDD[Instance] =
           dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
             case Row(label: Double, weight: Double, features: Vector) =>
               Instance(label, weight, features)
           }
     
    -    if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK)
    +    if (dataset.storageLevel == StorageLevel.NONE) {
    +      if ($(handlePersistence)) {
    +        instances.persist(StorageLevel.MEMORY_AND_DISK)
    +      } else {
    +        logWarning("The input dataset is uncached, which may hurt 
performance if its upstreams " +
    +          "are also uncached.")
    +      }
    +    }
    --- End diff --
    
    @smurching @zhengruifeng 
    For estimators inherit `Predictor`(such as LiR/LoR), the checking `if 
(dataset.storageLevel == StorageLevel.NONE) ` help nothing.
    Because in `Predictor.fit` it generate casted Dataset from input Dataset, 
this cause the original Dataset storageLevel to be cleared. (i.e, 
`dataset.storageLevel` will always be `None`, even when the input dataset is 
cached).
    I remember this issue has been discussed in Jira & old PRs, but this PR do 
not resolve this issue.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19186: [SPARK-21972][ML] Add param handlePersistence

Reply via email to