[jira] [Commented] (SPARK-5995) Make ML Prediction Developer APIs public

Joseph K. Bradley (JIRA) Mon, 20 Apr 2015 18:07:11 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504090#comment-14504090
 ]


Joseph K. Bradley commented on SPARK-5995:
------------------------------------------

I just updated the design doc linked above with a new section "Post-Part 1 
Assessment" detailing a few issues.

> Make ML Prediction Developer APIs public
> ----------------------------------------
>
>                 Key: SPARK-5995
>                 URL: https://issues.apache.org/jira/browse/SPARK-5995
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>            Assignee: Joseph K. Bradley
>
> Previously, some Developer APIs were added to spark.ml for classification and 
> regression to make it easier to add new algorithms and models: [SPARK-4789]  
> There are ongoing discussions about the best design of the API.  This JIRA is 
> to continue that discussion and try to finalize those Developer APIs so that 
> they can be made public.
> Please see [this design doc from SPARK-4789 | 
> https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]
>  for details on the original API design.
> Some issues under debate:
> * Should there be strongly typed APIs for fit()?
> * Should the strongly typed API for transform() be public (vs. protected)?
> * What transformation methods should the API make developers implement for 
> classification?  (See details below.)
> * Should there be a way to transform a single Row (instead of only 
> DataFrames)?
> More on "What transformation methods should the API make developers implement 
> for classification?":
> * Goals:
> ** Optimize transform: Make it fast, and make it output only the desired 
> columns.
> ** Easy development
> ** Support Classifier, Regressor, and ProbabilisticClassifier
> * (currently) Developers implement predictX methods for each output column X. 
>  They may override transform() to optimize speed.
> ** Pros: predictX is easy to understand.
> ** Cons: An optimized transform() is annoying to write.
> * Developers implement more basic transformation methods, such as 
> features2raw, raw2pred, raw2prob.
> ** Pros: Abstract classes may implement optimized transform().
> ** Cons: Different types of predictors require different methods:
> *** Predictor and Regressor: features2pred
> *** Classifier: features2raw, raw2pred
> *** ProbabilisticClassifier: raw2prob
> * Developers implement a single predict() method which takes parameters for 
> what columns to output (returning tuple or some type with None for missing 
> values).  Abstract classes take the outputs they want and put them into 
> columns.
> ** Pros: Developers only write 1 method and can optimize it as much as they 
> want.  It could be more optimized than the previous 2 options; e.g., if 
> LogisticRegressionModel only wants the prediction, then it never has to 
> construct intermediate results such as the vector of raw predictions.
> ** Cons: predict() will have a different signature for different 
> abstractions, based on the possible output columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-5995) Make ML Prediction Developer APIs public

Reply via email to