[ 
https://issues.apache.org/jira/browse/SPARK-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343644#comment-14343644
 ] 

Joseph K. Bradley edited comment on SPARK-5995 at 3/2/15 7:38 PM:
------------------------------------------------------------------

Pinging all people who commented on 
[https://github.com/apache/spark/pull/3637]:  [~sparks], [~shivaram], 
[~lewuathe], [~srowen], [~tomerk], [~prudenko], [~mengxr]

If you have further thoughts about what other changes would make it easier for 
developer to write new algorithms in spark.ml, please discuss here!  I'll mull 
this over for a while before making a PR.  Currently, the main change I'm 
planning on is the comment in the description above about "Developers implement 
more basic transformation methods, such as features2raw, raw2pred, raw2prob."  
But if there are other useful changes, please say, even if it includes removing 
some of the abstractions or functionality introduced in my previous PR.

Thanks in advance!


was (Author: josephkb):
Pinging all people who commented on 
[https://github.com/apache/spark/pull/3637]:  [~sparks] [~shivaram] [~lewuathe] 
[~srowen] [~tomerk] [~prudenko] [~mengxr]

If you have further thoughts about what other changes would make it easier for 
developer to write new algorithms in spark.ml, please discuss here!  I'll mull 
this over for a while before making a PR.  Currently, the main change I'm 
planning on is the comment in the description above about "Developers implement 
more basic transformation methods, such as features2raw, raw2pred, raw2prob."  
But if there are other useful changes, please say, even if it includes removing 
some of the abstractions or functionality introduced in my previous PR.

Thanks in advance!

> Make ML Prediction Developer APIs public
> ----------------------------------------
>
>                 Key: SPARK-5995
>                 URL: https://issues.apache.org/jira/browse/SPARK-5995
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>            Assignee: Joseph K. Bradley
>
> Previously, some Developer APIs were added to spark.ml for classification and 
> regression to make it easier to add new algorithms and models: [SPARK-4789]  
> There are ongoing discussions about the best design of the API.  This JIRA is 
> to continue that discussion and try to finalize those Developer APIs so that 
> they can be made public.
> Please see [this design doc from SPARK-4789 | 
> https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]
>  for details on the original API design.
> Some issues under debate:
> * Should there be strongly typed APIs for fit()?
> * Should the strongly typed API for transform() be public (vs. protected)?
> * What transformation methods should the API make developers implement for 
> classification?  (See details below.)
> * Should there be a way to transform a single Row (instead of only 
> DataFrames)?
> More on "What transformation methods should the API make developers implement 
> for classification?":
> * Goals:
> ** Optimize transform: Make it fast, and make it output only the desired 
> columns.
> ** Easy development
> ** Support Classifier, Regressor, and ProbabilisticClassifier
> * (currently) Developers implement predictX methods for each output column X. 
>  They may override transform() to optimize speed.
> ** Pros: predictX is easy to understand.
> ** Cons: An optimized transform() is annoying to write.
> * Developers implement more basic transformation methods, such as 
> features2raw, raw2pred, raw2prob.
> ** Pros: Abstract classes may implement optimized transform().
> ** Cons: Different types of predictors require different methods:
> *** Predictor and Regressor: features2pred
> *** Classifier: features2raw, raw2pred
> *** ProbabilisticClassifier: raw2prob
> * Developers implement a single predict() method which takes parameters for 
> what columns to output (returning tuple or some type with None for missing 
> values).  Abstract classes take the outputs they want and put them into 
> columns.
> ** Pros: Developers only write 1 method and can optimize it as much as they 
> want.  It could be more optimized than the previous 2 options; e.g., if 
> LogisticRegressionModel only wants the prediction, then it never has to 
> construct intermediate results such as the vector of raw predictions.
> ** Cons: predict() will have a different signature for different 
> abstractions, based on the possible output columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to