[ 
https://issues.apache.org/jira/browse/SPARK-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14204713#comment-14204713
 ] 

Martin Jaggi commented on SPARK-1227:
-------------------------------------

actually this is still relevant, as looking at the training objective value is 
the only way to tell if a chosen model has been properly trained or not. 
without this, the user will not know if a resulting bad test error should be 
blamed on poor training of a good model, or on choosing a wrong model (e.g. 
wrong regularization parameter). both happens often. or imagine a deep net, 
where the same question is even harder to tell apart.

now that MLlib starts offering different training algorithms for the same 
models (e.g. SGD and L-BFGS), and also different ways of distributing training, 
the training objective would definitely be useful compare algorithms. (or also 
when comparing different step-seize regimes, such as here:  
https://issues.apache.org/jira/browse/SPARK-3942  )

maybe the nicest way around this would be to provide a nice *benchmarking 
suite* for regression and classification, which could be used to judge 
different algorithms in this respect, also if new contributed algorithms need 
to be compared for efficiency.

this is also related to the currently not so nice way of passing around the 
regularizer values (which are part of the training optimization objective) 
through the updater function, which is currently quite different in SGD as 
compared to L-BFGS, see the issue here:
https://issues.apache.org/jira/browse/SPARK-2505
and the spark L-BFBS implementation:
https://github.com/apache/spark/pull/353

irrespective of training error, the classifier methods would also benefit from 
adding the test accuracy percentage as a function (see current code examples, 
where this still has to be calculated manually, as it's not implemented yet in 
{{BinaryClassificationMetrics}}.

> Diagnostics for Classification&Regression
> -----------------------------------------
>
>                 Key: SPARK-1227
>                 URL: https://issues.apache.org/jira/browse/SPARK-1227
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Martin Jaggi
>            Assignee: Martin Jaggi
>
> Currently, the attained objective function is not computed (for efficiency 
> reasons, as one evaluation requires one full pass through the data).
> For diagnostics and comparing different algorithms, we should however provide 
> this as a separate function (one MR).
> Doing this requires the loss and regularizer functions themselves, not only 
> their gradients (which are currently in the Gradient class). How about adding 
> the new function directly on the corresponding models in classification/* and 
> regression/* ? Any thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to