[ 
https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16938465#comment-16938465
 ] 

Maciej Szymkiewicz commented on SPARK-29212:
--------------------------------------------

[~podongfeng] First of all thank your for creating this ticket.

 

??Would you like to help work on this???

If we can reach some consensus here, sure. That's only a dozen of LOCs anyway.

However personally I am more interested in general directions. As I tried to 
point out (not sure if successfully), the current resolution of SPARK-28985, 
despite its negligible impact as such, sets dangerous precedent and breaks up 
with existing ML API conventions. At this point I still cannot tell if that 
happened by accident, or if it is intentional (not being actively involved in 
the development process I get an impression that the latter might be true).
  

> Add common classes without using JVM backend
> --------------------------------------------
>
>                 Key: SPARK-29212
>                 URL: https://issues.apache.org/jira/browse/SPARK-29212
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML, PySpark
>    Affects Versions: 3.0.0
>            Reporter: zhengruifeng
>            Priority: Major
>
> copyed from [https://github.com/apache/spark/pull/25776.]
>  
> Maciej's *Concern*:
> *Use cases for public ML type hierarchy*
>  * Add Python-only Transformer implementations:
>  ** I am Python user and want to implement pure Python ML classifier without 
> providing JVM backend.
>  ** I want this classifier to be meaningfully positioned in the existing type 
> hierarchy.
>  ** However I have access only to high level classes ({{Estimator}}, 
> {{Model}}, {{MLReader}} / {{MLReadable}}).
>  * Run time parameter validation for both user defined (see above) and 
> existing class hierarchy,
>  ** I am a library developer who provides functions that are meaningful only 
> for specific categories of {{Estimators}} - here classifiers.
>  ** I want to validate that user passed argument is indeed a classifier:
>  *** For built-in objects using "private" type hierarchy is not really 
> satisfying (actually, what is the rationale behind making it "private"? If 
> the goal is Scala API parity, and Scala counterparts are public, shouldn't 
> these be too?).
>  ** For user defined objects I can:
>  *** Use duck typing (on {{setRawPredictionCol}} for classifier, on 
> {{numClasses}} for classification model) but it hardly satisfying.
>  *** Provide parallel non-abstract type hierarchy ({{Classifier}} or 
> {{PythonClassifier}} and so on) and require users to implement such 
> interfaces. That however would require separate logic for checking for 
> built-in and and user-provided classes.
>  *** Provide parallel abstract type hierarchy, register all existing built-in 
> classes and require users to do the same.
> Clearly these are not satisfying solutions as they require either defensive 
> programming or reinventing the same functionality for different 3rd party 
> APIs.
>  * Static type checking
>  ** I am either end user or library developer and want to use PEP-484 
> annotations to indicate components that require classifier or classification 
> model.
>  ** Currently I can provide only imprecise annotations, [such 
> as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241]
> def setClassifier(self, value: Estimator[M]) -> OneVsRest: ...
> or try to narrow things down using structural subtyping:
> class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, 
> value: str) -> Classifier: ... class Classifier(Protocol, Model): def 
> setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> 
> int: ...
>  
> Maciej's *Proposal*:
> {code:java}
> Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e.
> class ClassifierParams: ...
> class Predictor(Estimator,PredictorParams):
>     def setLabelCol(self, value): ...
>     def setFeaturesCol(self, value): ...
>     def setPredictionCol(self, value): ...
> class Classifier(Predictor, ClassifierParams):
>     def setRawPredictionCol(self, value): ...
> class PredictionModel(Model,PredictorParams):
>     def setFeaturesCol(self, value): ...
>     def setPredictionCol(self, value): ...
>     def numFeatures(self): ...
>     def predict(self, value): ...
> and JVM interop should extend from this hierarchy, i.e.
> class JavaPredictionModel(PredictionModel): ...
> In other words it should be consistent with existing approach, where we have 
> ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and 
> Java* variants are their subclasses.
>  {code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to