Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/4677#issuecomment-75148803
The train + run combination is a legacy thing, where algorithms were
originally called using static methods (train) with a list of arguments. As
algorithms gained more arguments, we moved towards builder patterns, where you
build a learner with the parameters/settings you want and then call run.
About Classification, you're getting at a problem inherent in many ML
methods: We often learn by optimizing surrogate losses. I.e., we may care
about 0/1 accuracy, not about the logistic loss, but we'll still learn using
the logistic loss since the math & optimization work out nicely for it. I'd
recommend:
* For the tests, switching to Regression manually seems reasonable.
* To help users:
* Add a note in the docs about validation error not changing
monotonically. Recommend to users that they inspect the validation curve
manually.
* Maybe allow negative tolerances. That would let learning run for a bit
longer, but would still stop early when the algorithm really started
overfitting.
One more thought: It might be best to compare with the best validation
error found so far, rather than the previous iteration's value. That would
prevent slow and steady overfitting.
Thanks for the updates!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]