[
https://issues.apache.org/jira/browse/SPARK-18704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15719499#comment-15719499
]
yuhao yang commented on SPARK-18704:
------------------------------------
One implementation for the tuning summary is available at
https://github.com/hhbyyh/spark/tree/tuningsummary/mllib/src/main/scala/org/apache/spark/ml/tuning
for anyone with interest.
> CrossValidator should preserve more tuning statistics
> -----------------------------------------------------
>
> Key: SPARK-18704
> URL: https://issues.apache.org/jira/browse/SPARK-18704
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Reporter: yuhao yang
> Priority: Minor
>
> Currently CrossValidator will train (k-fold * paramMaps) different models
> during the training process, yet it only passes the average metrics to
> CrossValidatorModel. From which some important information like variances for
> the same paramMap cannot be retrieved, and users cannot be sure if the k
> number is proper. Since the CrossValidator is relatively expensive, we
> probably want to get the most from the tuning process.
> Just want to see if this sounds good. In my opinion, this can be done either
> by passing a metrics matrix to the CrossValidatorModel, or we can introduce a
> CrossValidatorSummary. I would vote for introducing the TunningSummary class,
> which can also be used by TrainValidationSplit. In the summary we can present
> a better statistics for the tuning process. Something like a DataFrame:
> +---------------+------------+--------+-----------------+
> |elasticNetParam|fitIntercept|regParam|metrics |
> +---------------+------------+--------+-----------------+
> |0.0 |true |0.1 |9.747795248932505|
> |0.0 |true |0.01 |9.751942357398603|
> |0.0 |false |0.1 |9.71727627087487 |
> |0.0 |false |0.01 |9.721149803723822|
> |0.5 |true |0.1 |9.719358515436005|
> |0.5 |true |0.01 |9.748121645368501|
> |0.5 |false |0.1 |9.687771328829479|
> |0.5 |false |0.01 |9.717304811419261|
> |1.0 |true |0.1 |9.696769467196487|
> |1.0 |true |0.01 |9.744325276259957|
> |1.0 |false |0.1 |9.665822167122172|
> |1.0 |false |0.01 |9.713484065511892|
> +---------------+------------+--------+-----------------+
> Using the dataFrame, users can better understand the effect of different
> parameters.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]