Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/372#issuecomment-40027762
  
    There are two components to why I think this might want to change.
    
    First is just the type-safety issue, which is the same reason one would use 
an enum instead of int in Java. Encoding categoricals as continuous values 
invites, maybe unwittingly, invalid operations, like finding the distance 
between "apple" and "orange". Or trying to apply linear regression as a 
classifier. I suppose it also means there must be a translation layer, yes, in 
all cases.
    
    DecisionTreeModel makes it a little more concrete. I imagine it should have 
both the RegressionModel and ClassificationModel traits eventually? Both traits 
want a predict method that returns Double, OK. This gets a little dicier since 
it's not just a classifier method return Doubles-that-are-labels, but a single 
method that some other times returns Doubles-that-are-numbers.
    
    The AUC-related PR raises a second point. How would you return a 
distribution over labels? That's a fairly sensible thing to do, esp. for things 
like random forests. A Double can't encode that, and at least has to be a bunch 
of (Double,Double) pairs or something. That PR proceeds by just making a second 
set of methods. In any event that kind of change also needs an API change, of a 
different kind.
    
    From the PMML angle -- you could note that the abstraction in PMML thinks 
of categoricals as a different type from numeric. If those models were to be 
supported there would have to be some translating layer on top, within MLlib, 
to emulate the abstraction at the same level. So then maybe that's a reason to 
just make it the MLlib's abstraction too?
    
    OK enough of all that, it's a big tangent. That's the argument I have for 
at least deferring setting this in stone.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to