[jira] [Commented] (SPARK-18704) CrossValidator should preserve more tuning statistics

Nick Pentreath (JIRA) Sun, 04 Dec 2016 23:14:11 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-18704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15721448#comment-15721448
 ]


Nick Pentreath commented on SPARK-18704:
----------------------------------------

Yeah, I like this idea. I've also been finding that tying the metrics back to 
the params that generated them is very painful currently, limiting the 
usefulness of the cross-validator.

I think a summary class returning a {{DataFrame}} would be make most sense - 
e.g. for stats and visualization. Do you propose to work on it? 

> CrossValidator should preserve more tuning statistics
> -----------------------------------------------------
>
>                 Key: SPARK-18704
>                 URL: https://issues.apache.org/jira/browse/SPARK-18704
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: yuhao yang
>            Priority: Minor
>
> Currently CrossValidator will train (k-fold * paramMaps) different models 
> during the training process, yet it only passes the average metrics to 
> CrossValidatorModel. From which some important information like variances for 
> the same paramMap cannot be retrieved, and users cannot be sure if the k 
> number is proper. Since the CrossValidator is relatively expensive, we 
> probably want to get the most from the tuning process.
> Just want to see if this sounds good. In my opinion, this can be done either 
> by passing a metrics matrix to the CrossValidatorModel, or we can introduce a 
> CrossValidatorSummary. I would vote for introducing the TunningSummary class, 
> which can also be used by TrainValidationSplit. In the summary we can present 
> a better statistics for the tuning process. Something like a DataFrame:
> +---------------+------------+--------+-----------------+
> |elasticNetParam|fitIntercept|regParam|metrics          |
> +---------------+------------+--------+-----------------+
> |0.0            |true        |0.1     |9.747795248932505|
> |0.0            |true        |0.01    |9.751942357398603|
> |0.0            |false       |0.1     |9.71727627087487 |
> |0.0            |false       |0.01    |9.721149803723822|
> |0.5            |true        |0.1     |9.719358515436005|
> |0.5            |true        |0.01    |9.748121645368501|
> |0.5            |false       |0.1     |9.687771328829479|
> |0.5            |false       |0.01    |9.717304811419261|
> |1.0            |true        |0.1     |9.696769467196487|
> |1.0            |true        |0.01    |9.744325276259957|
> |1.0            |false       |0.1     |9.665822167122172|
> |1.0            |false       |0.01    |9.713484065511892|
> +---------------+------------+--------+-----------------+
> Using the dataFrame, users can better understand the effect of different 
> parameters.
> Another thing we should improve is to include the paramMaps in the 
> CrossValidatorModel (or TrainValidationSplitModel) to allow meaningful 
> serialization. Keeping only the metrics without ParamMaps does not really 
> help model reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-18704) CrossValidator should preserve more tuning statistics

Reply via email to