[ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14153763#comment-14153763
 ] 

Joseph K. Bradley commented on SPARK-3702:
------------------------------------------

Thanks for taking a close look!

* Abstraction of Multilabel
  Things definitely get more complex with multiple labels, and it is not clear 
to me the best way to handle it.  I agree it would not make sense to have a 
whole bunch of types of the different combinations of multiple labels.  Perhaps 
the abstraction should be MultilabelEstimator, which can predict any 
combination of categories and/or real values.
** Note: It should not be a list of Estimators since proper multilabel 
prediction would do joint prediction, rather than predicting each label 
separately.

* Model-based vs. memory-based
  Would these two concepts affect the public API?  I don't think they would, 
but do you have an example for why there should be a shared abstract class?
** For k-nearest-neighbors, I think the same Classifier and Classifier.Model 
abstraction would work.  The Classifier would ideally compute some nice data 
structure for finding nearest neighbors, and the Model would store that data 
structure (or the original dataset for a very basic implementation).

* Model vs. Estimator Abstraction
  I think you're bringing up an important point about public vs. developer 
interfaces.  Here's what I mean:
** Public interfaces: For most users, the functionality is the most important 
aspect.  E.g., most users need to know they are using a Classifier, regardless 
of whether it is a DecisionTree or a GLM.
** Developer (private[mllib]) interfaces: For developers, abstractions such as 
DecisionTree and GLM are very important.
** Proposal: As part of the "Standardize MLlib interfaces," I hope to first 
clarify the public interfaces and decide what interfaces need to be exposed.  
As needed, we can work on improving the developer interfaces for specific 
groups of algorithms.
*** For this, the [JIRA on clarifying GLM interfaces 
https://issues.apache.org/jira/browse/SPARK-3251] seems like an important one, 
but it may be blocked by updates to the public MLlib API.

Does that sound reasonable?

With respect to traits vs. abstract classes, I agree it may be good to keep the 
lightweight public interfaces be defined as traits as much as possible.

Almost done with initial prototype code, and will post that soon.

> Standardize MLlib classes for learners, models
> ----------------------------------------------
>
>                 Key: SPARK-3702
>                 URL: https://issues.apache.org/jira/browse/SPARK-3702
>             Project: Spark
>          Issue Type: Sub-task
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>            Assignee: Joseph K. Bradley
>            Priority: Blocker
>
> Summary: Create a class hierarchy for learning algorithms and the models 
> those algorithms produce.
> Goals:
> * give intuitive structure to API, both for developers and for generated 
> documentation
> * support meta-algorithms (e.g., boosting)
> * support generic functionality (e.g., evaluation)
> * reduce code duplication across classes
> [Design doc for class hierarchy | 
> https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to