Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19186#discussion_r138577518
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
---
@@ -483,24 +488,24 @@ class LogisticRegression @Since("1.2.0") (
this
}
- override protected[spark] def train(dataset: Dataset[_]):
LogisticRegressionModel = {
- val handlePersistence = dataset.storageLevel == StorageLevel.NONE
- train(dataset, handlePersistence)
- }
-
- protected[spark] def train(
- dataset: Dataset[_],
- handlePersistence: Boolean): LogisticRegressionModel = {
+ protected[spark] def train(dataset: Dataset[_]): LogisticRegressionModel
= {
val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0)
else col($(weightCol))
val instances: RDD[Instance] =
dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
case Row(label: Double, weight: Double, features: Vector) =>
Instance(label, weight, features)
}
- if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK)
+ if (dataset.storageLevel == StorageLevel.NONE) {
+ if ($(handlePersistence)) {
+ instances.persist(StorageLevel.MEMORY_AND_DISK)
+ } else {
+ logWarning("The input dataset is uncached, which may hurt
performance if its upstreams " +
+ "are also uncached.")
+ }
+ }
--- End diff --
@smurching @zhengruifeng
For estimators inherit `Predictor`(such as LiR/LoR), the checking `if
(dataset.storageLevel == StorageLevel.NONE) ` help nothing.
Because in `Predictor.fit` it generate casted Dataset from input Dataset,
this cause the original Dataset storageLevel to be cleared. (i.e,
`dataset.storageLevel` will always be `None`, even when the input dataset is
cached).
I remember this issue has been discussed in Jira & old PRs, but this PR do
not resolve this issue.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]