[
https://issues.apache.org/jira/browse/SPARK-18704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15721448#comment-15721448
]
Nick Pentreath commented on SPARK-18704:
----------------------------------------
Yeah, I like this idea. I've also been finding that tying the metrics back to
the params that generated them is very painful currently, limiting the
usefulness of the cross-validator.
I think a summary class returning a {{DataFrame}} would be make most sense -
e.g. for stats and visualization. Do you propose to work on it?
> CrossValidator should preserve more tuning statistics
> -----------------------------------------------------
>
> Key: SPARK-18704
> URL: https://issues.apache.org/jira/browse/SPARK-18704
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Reporter: yuhao yang
> Priority: Minor
>
> Currently CrossValidator will train (k-fold * paramMaps) different models
> during the training process, yet it only passes the average metrics to
> CrossValidatorModel. From which some important information like variances for
> the same paramMap cannot be retrieved, and users cannot be sure if the k
> number is proper. Since the CrossValidator is relatively expensive, we
> probably want to get the most from the tuning process.
> Just want to see if this sounds good. In my opinion, this can be done either
> by passing a metrics matrix to the CrossValidatorModel, or we can introduce a
> CrossValidatorSummary. I would vote for introducing the TunningSummary class,
> which can also be used by TrainValidationSplit. In the summary we can present
> a better statistics for the tuning process. Something like a DataFrame:
> +---------------+------------+--------+-----------------+
> |elasticNetParam|fitIntercept|regParam|metrics |
> +---------------+------------+--------+-----------------+
> |0.0 |true |0.1 |9.747795248932505|
> |0.0 |true |0.01 |9.751942357398603|
> |0.0 |false |0.1 |9.71727627087487 |
> |0.0 |false |0.01 |9.721149803723822|
> |0.5 |true |0.1 |9.719358515436005|
> |0.5 |true |0.01 |9.748121645368501|
> |0.5 |false |0.1 |9.687771328829479|
> |0.5 |false |0.01 |9.717304811419261|
> |1.0 |true |0.1 |9.696769467196487|
> |1.0 |true |0.01 |9.744325276259957|
> |1.0 |false |0.1 |9.665822167122172|
> |1.0 |false |0.01 |9.713484065511892|
> +---------------+------------+--------+-----------------+
> Using the dataFrame, users can better understand the effect of different
> parameters.
> Another thing we should improve is to include the paramMaps in the
> CrossValidatorModel (or TrainValidationSplitModel) to allow meaningful
> serialization. Keeping only the metrics without ParamMaps does not really
> help model reuse.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]