Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/19186#discussion_r138577518 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -483,24 +488,24 @@ class LogisticRegression @Since("1.2.0") ( this } - override protected[spark] def train(dataset: Dataset[_]): LogisticRegressionModel = { - val handlePersistence = dataset.storageLevel == StorageLevel.NONE - train(dataset, handlePersistence) - } - - protected[spark] def train( - dataset: Dataset[_], - handlePersistence: Boolean): LogisticRegressionModel = { + protected[spark] def train(dataset: Dataset[_]): LogisticRegressionModel = { val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol)) val instances: RDD[Instance] = dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map { case Row(label: Double, weight: Double, features: Vector) => Instance(label, weight, features) } - if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK) + if (dataset.storageLevel == StorageLevel.NONE) { + if ($(handlePersistence)) { + instances.persist(StorageLevel.MEMORY_AND_DISK) + } else { + logWarning("The input dataset is uncached, which may hurt performance if its upstreams " + + "are also uncached.") + } + } --- End diff -- @smurching @zhengruifeng For estimators inherit `Predictor`(such as LiR/LoR), the checking `if (dataset.storageLevel == StorageLevel.NONE) ` help nothing. Because in `Predictor.fit` it generate casted Dataset from input Dataset, this cause the original Dataset storageLevel to be cleared. (i.e, `dataset.storageLevel` will always be `None`, even when the input dataset is cached). I remember this issue has been discussed in Jira & old PRs, but this PR do not resolve this issue.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org