Github user zhengruifeng commented on the issue:
https://github.com/apache/spark/pull/19186
@WeichenXu123 Thanks a lot for pointing it out! I also forgot about this.
@smurching Thanks for your solution, however, I think there maybe exist
another drawback in it: The alg usually use only several columns in the
dataset, like `featuresCol`/`labelCol`. So we should not cache the whole
dataset, we should just cache the used columns.
I think the issue of handling persistence maybe somewhat complexed, and it
is up to the author of algs to decide whether, when and where to persist.
In current impl of `LinearRegression`, if the `Normal` solver is chosen,
then the input will not be cached. if the `LBFGS` solver is chosen, the input
will always be cached. However, for `LBFGS` solver, the input may not need to
be cached if the param `maxIter` is set 0 or 1.
I prefer this solution: add methods `preprocess(dataset: Dataset[_]):
DataFrame` and `postprocess(dataset: DataFrame)` in `Predictor`. The two
methods will be called before and after `train()`. In `preprocess`, we do
casting and persist in it, while in `postprocess`, we unpersist the
intermediate dataframe if needed. Algs can override those for specific purpose.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]