Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/372#issuecomment-40027762
There are two components to why I think this might want to change.
First is just the type-safety issue, which is the same reason one would use
an enum instead of int in Java. Encoding categoricals as continuous values
invites, maybe unwittingly, invalid operations, like finding the distance
between "apple" and "orange". Or trying to apply linear regression as a
classifier. I suppose it also means there must be a translation layer, yes, in
all cases.
DecisionTreeModel makes it a little more concrete. I imagine it should have
both the RegressionModel and ClassificationModel traits eventually? Both traits
want a predict method that returns Double, OK. This gets a little dicier since
it's not just a classifier method return Doubles-that-are-labels, but a single
method that some other times returns Doubles-that-are-numbers.
The AUC-related PR raises a second point. How would you return a
distribution over labels? That's a fairly sensible thing to do, esp. for things
like random forests. A Double can't encode that, and at least has to be a bunch
of (Double,Double) pairs or something. That PR proceeds by just making a second
set of methods. In any event that kind of change also needs an API change, of a
different kind.
From the PMML angle -- you could note that the abstraction in PMML thinks
of categoricals as a different type from numeric. If those models were to be
supported there would have to be some translating layer on top, within MLlib,
to emulate the abstraction at the same level. So then maybe that's a reason to
just make it the MLlib's abstraction too?
OK enough of all that, it's a big tangent. That's the argument I have for
at least deferring setting this in stone.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---