[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

srowen Wed, 10 Dec 2014 02:18:06 -0800

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/3637#issuecomment-66430592
  
    So, I may not be 100% up to speed with the new API and these changes, so my 
comments may be a bit off, but:
    
    An Estimator makes a Model. To make a model, you need "raw data" and its 
interpretation, if you will. a LabeledPoint is "raw data". That alone is not 
sufficient to train a Classifier (Estimator). Yes, this extra info has to come 
from somewhere.
    
    I agree that SchemaRDD contains, or could contain, or could be made to 
deduce, this extra interpretation, so the SchemaRDD API makes sense to me.
    
    If LabeledPoint is to remain the "raw data", given the conversation here, 
then it has to be parameters or something. I think you still need these for 
testing, right? you still need to know what the raw data means. Or is it 
assumed that the built Classifier / Model stores this info?
    
    This is sort of a rehash of the same exchange we just had, in that the 
question is caused by the input data abstraction not really containing all the 
input -- the metadata comes along separately. Which could be OK but yes it 
means this question pops up somewhere else in the API.
    
    Yes, a Model may be able to remember the metadata and accept raw 
LabeledPoints in the future. You just have to make sure you are feeding raw 
LabeledPoints that use the same metadata, but that's a given no matter how you 
design this.
    
    To answer the question: given the question, I'd hide the typed API, I 
suppose. I think the typed API has to take some other values to contain 
metadata like the type of features, etc. These could be more parameters, then? 
it kind of overloads the meaning, since the parameters look like they are 
intended to be hyper parameters. But it's not crazy.
    
    Transformations: these feel like these could meaningfully operate on raw 
data, so, typed API makes sense to me and could be public now.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

Reply via email to