Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3637#issuecomment-66368865
@srowen @Lewuathe Continuing the above inline discussion...
Question: Should the typed interface be public?
New proposal: Hide the typed interface of Estimators. Leave the typed
interface of Transformers exposed.
Argument:
* The typed interface loses metadata which SchemaRDD can (but does not yet)
store.
* E.g., for Classifiers, it is good to know the number of classes to
predict, which features are categorical, and the number of categories for each
categorical feature. The current typed train() methods do not have this info;
to pass in this info, we'll need either (a) extra parameters in train() which
would make Classifiers have a different signature than other Estimators'
train() methods or (b) extra embedded parameters in Classifiers which would be
ignored when using the fit(SchemaRDD) interface. Neither option sounds good to
me.
* We could use a typed interface with stronger typing for features, but
that would still not cover metadata like # classes / categories.
* This metadata is important for training, but it is not important for
testing. We would just need to make sure that Vectors passed predict() methods
had the same feature order as used for training.
* I would guess the typed interface would be most useful for Models. This
is based on me assuming that:
* Models will be kept for longer and might have predict() methods called
multiple times, including on individual instances, and
* Models might need typed APIs for efficiency if used in production.
What do you think?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]