gtolomei commented on issue #14456:
URL: https://github.com/apache/spark/pull/14456#issuecomment-618318444
Hi everyone,
I know this thread is closed and the bug on how `avgMetrics` was previously
computed has been fixed. Still, I'm experiencing an odd issue with something
related to this, which I will try to explain in the following.
Basically, I have setup a `CrossValidator` object in combination with a
linear regression pipeline and a grid of hyperparameters to select from. More
specifically, I run 5-fold cross validation on 9 different settings resulting
from the combinations of two hyperparameters (each one taking on 3 values), and
I keep track of _all_ the 45 resulting models by setting the `collectSubModels`
flag to `True`:
```
...
lr = LinearRegression(featuresCol="features", labelCol="label")
pipeline = Pipeline(stages=indexers + [encoder] + [assembler] + [lr])
param_grid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.0, 0.05, 0.1]) \
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
.build()
cross_val = CrossValidator(estimator=pipeline,
estimatorParamMaps=param_grid,
evaluator=RegressionEvaluator(metricName="rmse"),
numFolds=5,
collectSubModels=True
)
# Run cross-validation, and choose the best set of parameters
cv_model = cross_val.fit(train)
return cv_model
```
Everything seems to run smoothly, except for the fact that when I'm printing
out the performance (i.e., RMSE) of each model (i.e., 9 models for each fold)
and I try to "manually" compute the average from each fold, the resulting 9
average values **do not** match at all with the values I get when I use the
internal `avgMetrics` property of the `CrossValidator`.
Just to give you an example, the following are the 5 RMSE values I obtained
using the first combination of the two hyperparameters (i.e., both set to 0):
```
*************** Fold #1 ***************
--- Model #1 out of 9 ---
Parameters: lambda=[0.000]; alpha=[0.000]
RMSE: 149354.656
*************** Fold #2 ***************
--- Model #1 out of 9 ---
Parameters: lambda=[0.000]; alpha=[0.000]
RMSE: 146038.521
*************** Fold #3 ***************
--- Model #1 out of 9 ---
Parameters: lambda=[0.000]; alpha=[0.000]
RMSE: 148739.919
*************** Fold #4 ***************
--- Model #1 out of 9 ---
Parameters: lambda=[0.000]; alpha=[0.000]
RMSE: 146816.473
*************** Fold #5 ***************
--- Model #1 out of 9 ---
Parameters: lambda=[0.000]; alpha=[0.000]
RMSE: 149868.621
```
As you can see, all the values of RMSE are below 150,000.
My expectation was that if I had taken the average of those values above, I
would have got the first element of the `avgMetrics` list (which, indeed,
supposedly contains the cross-validation average of each hyperparameter
combination computed across the folds).
Instead, if I'm running `cv_model.avgMetrics` this is what I get:
```
[150091.7372030353, 150091.7372030353, 150091.7372030353, 150091.7345116686,
150093.66131828527, 150090.52769066638, 150091.7338301999, 150090.52716106002,
150091.59829053417]
```
There are 9 elements as expected but none of them looks correct! In fact,
all of them are above 150,000 even though none of my 45 models (not only the 5
I listed above) reaches those figures.
It looks like the way in which `avgMetrics` is populated is wrong.
I have also tried to inspect the [current implementation][1] of the `_fit`
method of the `CrossValidator` object and - although I haven't spent too much
time on this - apparently everything looks fine:
```
for i in range(nFolds):
validateLB = i * h
validateUB = (i + 1) * h
condition = (df[randCol] >= validateLB) & (df[randCol] < validateUB)
validation = df.filter(condition).cache()
train = df.filter(~condition).cache()
tasks = _parallelFitTasks(est, train, eva, validation, epm,
collectSubModelsParam)
for j, metric, subModel in pool.imap_unordered(lambda f: f(), tasks):
metrics[j] += (metric / nFolds)
if collectSubModelsParam:
subModels[i][j] = subModel
```
Interestingly enough, the same happens even if I run k-fold cross validation
on `LogisticRegression` using `areaUnderROC` as the evaluation metric on each
fold (for a different task).
Has anyone else experienced the same issue?
Many thanks, any help will be much appreciated!
G.
**NOTE:** I have blindly assumed the problem (if any) is on the `avgMetrics`
property; however, it might be that those averages are actually correct, whilst
the individual metrics which I have printed out above by calling the
`.summary.rootMeanSquaredError` on each submodel are computed wrongly. Either
way, there is a clear inconsistency between the two.
[1]:
https://spark.apache.org/docs/latest/api/python/_modules/pyspark/ml/tuning.html#CrossValidator
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]