gtolomei commented on issue #14456:
URL: https://github.com/apache/spark/pull/14456#issuecomment-618318444


   Hi everyone,
   
   I know this thread is closed and the bug on how `avgMetrics` was previously 
computed has been fixed. Still, I'm experiencing an odd issue with something 
related to this, which I will try to explain in the following.
   
   Basically, I have setup a `CrossValidator` object in combination with a 
linear regression pipeline and a grid of hyperparameters to select from. More 
specifically, I run 5-fold cross validation on 9 different settings resulting 
from the combinations of two hyperparameters (each one taking on 3 values), and 
I keep track of _all_ the 45 resulting models by setting the `collectSubModels` 
flag to `True`:
   
   ```
   ...
   
   lr = LinearRegression(featuresCol="features", labelCol="label")
   
   pipeline = Pipeline(stages=indexers + [encoder] + [assembler] + [lr])
   
   param_grid = ParamGridBuilder()\
           .addGrid(lr.regParam, [0.0, 0.05, 0.1]) \
           .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
           .build()
   
   cross_val = CrossValidator(estimator=pipeline, 
                              estimatorParamMaps=param_grid,
                              evaluator=RegressionEvaluator(metricName="rmse"),
                              numFolds=5,
                              collectSubModels=True
                              )
   
   # Run cross-validation, and choose the best set of parameters
   cv_model = cross_val.fit(train)
   
   return cv_model
   ```
   Everything seems to run smoothly, except for the fact that when I'm printing 
out the performance (i.e., RMSE) of each model (i.e., 9 models for each fold) 
and I try to "manually" compute the average from each fold, the resulting 9 
average values **do not** match at all with the values I get when I use the 
internal `avgMetrics` property of the `CrossValidator`.
   Just to give you an example, the following are the 5 RMSE values I obtained 
using the first combination of the two hyperparameters (i.e., both set to 0):
   
   ```
   *************** Fold #1 ***************
   --- Model #1 out of 9 ---
        Parameters: lambda=[0.000]; alpha=[0.000] 
        RMSE: 149354.656
   
   *************** Fold #2 ***************
   --- Model #1 out of 9 ---
        Parameters: lambda=[0.000]; alpha=[0.000] 
        RMSE: 146038.521
   
   *************** Fold #3 ***************
   --- Model #1 out of 9 ---
        Parameters: lambda=[0.000]; alpha=[0.000] 
        RMSE: 148739.919
   
   *************** Fold #4 ***************
   --- Model #1 out of 9 ---
        Parameters: lambda=[0.000]; alpha=[0.000] 
        RMSE: 146816.473
   
   *************** Fold #5 ***************
   --- Model #1 out of 9 ---
        Parameters: lambda=[0.000]; alpha=[0.000] 
        RMSE: 149868.621
   ```
   
   As you can see, all the values of RMSE are below 150,000.
   My expectation was that if I had taken the average of those values above, I 
would have got the first element of the `avgMetrics` list (which, indeed, 
supposedly contains the cross-validation average of each hyperparameter 
combination computed across the folds).
   Instead, if I'm running `cv_model.avgMetrics` this is what I get:
   
   ```
   [150091.7372030353, 150091.7372030353, 150091.7372030353, 150091.7345116686, 
150093.66131828527, 150090.52769066638, 150091.7338301999, 150090.52716106002, 
150091.59829053417]
   ```
   
   There are 9 elements as expected but none of them looks correct! In fact, 
all of them are above 150,000 even though none of my 45 models (not only the 5 
I listed above) reaches those figures.
   
   It looks like the way in which `avgMetrics` is populated is wrong.
   
   I have also tried to inspect the [current implementation][1] of the `_fit` 
method of the `CrossValidator` object and - although I haven't spent too much 
time on this - apparently everything looks fine:
   
   ```
   for i in range(nFolds):
       validateLB = i * h
       validateUB = (i + 1) * h
       condition = (df[randCol] >= validateLB) & (df[randCol] < validateUB)
       validation = df.filter(condition).cache()
       train = df.filter(~condition).cache()
   
       tasks = _parallelFitTasks(est, train, eva, validation, epm, 
collectSubModelsParam)
       for j, metric, subModel in pool.imap_unordered(lambda f: f(), tasks):
           metrics[j] += (metric / nFolds)
           if collectSubModelsParam:
               subModels[i][j] = subModel
   ```
   
   Interestingly enough, the same happens even if I run k-fold cross validation 
on `LogisticRegression` using `areaUnderROC` as the evaluation metric on each 
fold (for a different task).
   
   Has anyone else experienced the same issue?
   
   Many thanks, any help will be much appreciated!
   G.
   
   **NOTE:** I have blindly assumed the problem (if any) is on the `avgMetrics` 
property; however, it might be that those averages are actually correct, whilst 
the individual metrics which I have printed out above by calling the 
`.summary.rootMeanSquaredError` on each submodel are computed wrongly. Either 
way, there is a clear inconsistency between the two.
   
     [1]: 
https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/tuning.html#CrossValidator


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to