Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4677#issuecomment-75148803
  
    The train + run combination is a legacy thing, where algorithms were 
originally called using static methods (train) with a list of arguments.  As 
algorithms gained more arguments, we moved towards builder patterns, where you 
build a learner with the parameters/settings you want and then call run.
    
    About Classification, you're getting at a problem inherent in many ML 
methods: We often learn by optimizing surrogate losses.  I.e., we may care 
about 0/1 accuracy, not about the logistic loss, but we'll still learn using 
the logistic loss since the math & optimization work out nicely for it.  I'd 
recommend:
    * For the tests, switching to Regression manually seems reasonable.
    * To help users:
      * Add a note in the docs about validation error not changing 
monotonically.  Recommend to users that they inspect the validation curve 
manually.
      * Maybe allow negative tolerances.  That would let learning run for a bit 
longer, but would still stop early when the algorithm really started 
overfitting.
    
    One more thought: It might be best to compare with the best validation 
error found so far, rather than the previous iteration's value.  That would 
prevent slow and steady overfitting.
    
    Thanks for the updates!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to