Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/19186 @WeichenXu123 Thanks a lot for pointing it out! I also forgot about this. @smurching Thanks for your solution, however, I think there maybe exist another drawback in it: The alg usually use only several columns in the dataset, like `featuresCol`/`labelCol`. So we should not cache the whole dataset, we should just cache the used columns. I think the issue of handling persistence maybe somewhat complexed, and it is up to the author of algs to decide whether, when and where to persist. In current impl of `LinearRegression`, if the `Normal` solver is chosen, then the input will not be cached. if the `LBFGS` solver is chosen, the input will always be cached. However, for `LBFGS` solver, the input may not need to be cached if the param `maxIter` is set 0 or 1. I prefer this solution: add methods `preprocess(dataset: Dataset[_]): DataFrame` and `postprocess(dataset: DataFrame)` in `Predictor`. The two methods will be called before and after `train()`. In `preprocess`, we do casting and persist in it, while in `postprocess`, we unpersist the intermediate dataframe if needed. Algs can override those for specific purpose.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org