Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/3637#issuecomment-66430592
So, I may not be 100% up to speed with the new API and these changes, so my
comments may be a bit off, but:
An Estimator makes a Model. To make a model, you need "raw data" and its
interpretation, if you will. a LabeledPoint is "raw data". That alone is not
sufficient to train a Classifier (Estimator). Yes, this extra info has to come
from somewhere.
I agree that SchemaRDD contains, or could contain, or could be made to
deduce, this extra interpretation, so the SchemaRDD API makes sense to me.
If LabeledPoint is to remain the "raw data", given the conversation here,
then it has to be parameters or something. I think you still need these for
testing, right? you still need to know what the raw data means. Or is it
assumed that the built Classifier / Model stores this info?
This is sort of a rehash of the same exchange we just had, in that the
question is caused by the input data abstraction not really containing all the
input -- the metadata comes along separately. Which could be OK but yes it
means this question pops up somewhere else in the API.
Yes, a Model may be able to remember the metadata and accept raw
LabeledPoints in the future. You just have to make sure you are feeding raw
LabeledPoints that use the same metadata, but that's a given no matter how you
design this.
To answer the question: given the question, I'd hide the typed API, I
suppose. I think the typed API has to take some other values to contain
metadata like the type of features, etc. These could be more parameters, then?
it kind of overloads the meaning, since the parameters look like they are
intended to be hyper parameters. But it's not crazy.
Transformations: these feel like these could meaningfully operate on raw
data, so, typed API makes sense to me and could be public now.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]