Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/19186
  
    @WeichenXu123 Thanks a lot for pointing it out! I also forgot about this.
    @smurching Thanks for your solution, however, I think there maybe exist 
another drawback in it: The alg usually use only several columns in the 
dataset, like `featuresCol`/`labelCol`. So we should not cache the whole 
dataset, we should just cache the used columns.
    
    I think the issue of handling persistence maybe somewhat complexed, and it 
is up to the author of algs to decide whether, when and where to persist. 
    In current impl of `LinearRegression`, if the `Normal` solver is chosen, 
then the input will not be cached. if the `LBFGS` solver is chosen, the input 
will always be cached. However, for `LBFGS` solver, the input may not need to 
be cached if the param `maxIter` is set 0 or 1.
    
    I prefer this solution: add methods `preprocess(dataset: Dataset[_]): 
DataFrame` and `postprocess(dataset: DataFrame)` in `Predictor`. The two 
methods will be called before and after `train()`. In `preprocess`, we do 
casting and persist in it, while in `postprocess`, we unpersist the 
intermediate dataframe if needed. Algs can override those for specific purpose.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to