[
https://issues.apache.org/jira/browse/SPARK-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504090#comment-14504090
]
Joseph K. Bradley commented on SPARK-5995:
------------------------------------------
I just updated the design doc linked above with a new section "Post-Part 1
Assessment" detailing a few issues.
> Make ML Prediction Developer APIs public
> ----------------------------------------
>
> Key: SPARK-5995
> URL: https://issues.apache.org/jira/browse/SPARK-5995
> Project: Spark
> Issue Type: Sub-task
> Components: ML
> Affects Versions: 1.3.0
> Reporter: Joseph K. Bradley
> Assignee: Joseph K. Bradley
>
> Previously, some Developer APIs were added to spark.ml for classification and
> regression to make it easier to add new algorithms and models: [SPARK-4789]
> There are ongoing discussions about the best design of the API. This JIRA is
> to continue that discussion and try to finalize those Developer APIs so that
> they can be made public.
> Please see [this design doc from SPARK-4789 |
> https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]
> for details on the original API design.
> Some issues under debate:
> * Should there be strongly typed APIs for fit()?
> * Should the strongly typed API for transform() be public (vs. protected)?
> * What transformation methods should the API make developers implement for
> classification? (See details below.)
> * Should there be a way to transform a single Row (instead of only
> DataFrames)?
> More on "What transformation methods should the API make developers implement
> for classification?":
> * Goals:
> ** Optimize transform: Make it fast, and make it output only the desired
> columns.
> ** Easy development
> ** Support Classifier, Regressor, and ProbabilisticClassifier
> * (currently) Developers implement predictX methods for each output column X.
> They may override transform() to optimize speed.
> ** Pros: predictX is easy to understand.
> ** Cons: An optimized transform() is annoying to write.
> * Developers implement more basic transformation methods, such as
> features2raw, raw2pred, raw2prob.
> ** Pros: Abstract classes may implement optimized transform().
> ** Cons: Different types of predictors require different methods:
> *** Predictor and Regressor: features2pred
> *** Classifier: features2raw, raw2pred
> *** ProbabilisticClassifier: raw2prob
> * Developers implement a single predict() method which takes parameters for
> what columns to output (returning tuple or some type with None for missing
> values). Abstract classes take the outputs they want and put them into
> columns.
> ** Pros: Developers only write 1 method and can optimize it as much as they
> want. It could be more optimized than the previous 2 options; e.g., if
> LogisticRegressionModel only wants the prediction, then it never has to
> construct intermediate results such as the vector of raw predictions.
> ** Cons: predict() will have a different signature for different
> abstractions, based on the possible output columns.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]