zero323 commented on issue #25776: [SPARK-28985][PYTHON][ML] Add common classes 
(JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier) in PYTHON
URL: https://github.com/apache/spark/pull/25776#issuecomment-533488999
 
 
   @zhengruifeng 
   
   **Use cases for public ML type hierarchy**
   
   - Add Python-only Transformet implementations:
     - I am Python user and want to implement pure Python ML classifier without 
providing JVM backend.
     - I want this classifier to be meaningfully positioned in the existing 
type hierarchy.
     - However I have access only to high level classes (`Estimator`, `Model`, 
`MLReader` / `MLReadable`).
   
   - Run time parameter validation for both user defined (see above) and 
existing class hierarchy, 
     - I am a library developer who provides functions that are meaningful only 
for specific categories of `Estimators` - here classifiers.
     - I want to validate that user passed argument is indeed a classifier:
       - For built-in objects using "private" type hierarchy is not really 
satisfying (actually, what is the rationale behind making it "private"? If the 
goal is Scala API parity, and Scala counterparts are public, shouldn't these be 
too?).
      - For user defined objects I can:
        - Use duck typing (on `setRawPredictionCol` for classifier, on 
`numClasses` for classification model) but it hardly satisfying.
        - Provide parallel non-abstract type hierarchy (`Classifier` or 
`PythonClassifier` and so on) and require users to implement such interfaces. 
That however would require separate logic for checking for built-in and and 
user-provided classes.
        - Provide parallel abstract type hierarchy, register all existing 
built-in classes and require users to do the same. 
   
       Clearly these are not satisfying solutions as they require either 
defensive programming or reinventing the same functionality for different 3rd 
party APIs.
   
   - Static type checking
      - I am either end user or library developer and want to use PEP-484 
annotations to indicate components that require classifier or classification 
model.
      - Currently I can provide only imprecise annotations, [such 
as](https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241)
   
         ```python
         def setClassifier(self, value: Estimator[M]) -> OneVsRest: ...
         ```
   
        or try to narrow things down using structural subtyping:
      
        ```python
        class Classifier(Protocol, Estimator[M]):
            def setRawPredictionCol(self, value: str) -> Classifier: ...
   
        class Classifier(Protocol, Model):
            def setRawPredictionCol(self, value: str) -> Model: ...
            def numClasses(self) -> int: ...
        ```
       
   **State before this PR**
   
   As describe above. None of the described scenarios is really supported and 
verbose and repetitive workarounds are required.
   
   **Impact of this PR**
   
   > how newly add common classes in this PR affects the end users to implement 
their own hierarchy
   
   Strictly speaking this PR doesn't doesn't change anything.  However it
   
   - Solidifies and extends existing, IMHO highly insufficient structure, with 
it's "virtually private" classes and subordination to JVM interop. This can 
make any future improvements less likely.
   - Is a lost opportunity and, at least to some extent, fails to fulfill its 
own goal to _"have the same structure as Scala"_
   
   
   **Proposal**
   
   Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e.
   
   ```python
   class ClassifierParams: ...
   
   class Predictor(Estimator,PredictorParams):
       def setLabelCol(self, value): ...
       def setFeaturesCol(self, value): ...
       def setPredictionCol(self, value): ...
   
   class Classifier(Predictor, ClassifierParams):
       def setRawPredictionCol(self, value): ...
   
   class PredictionModel(Model,PredictorParams):
       def setFeaturesCol
       def setPredictionCol
       def numFeatures
       def predict
   ```
   
   and JVM interop should extend from this hierarchy, i.e.
   
   ```python
   class JavaPredictionModel(PredictionModel): ...
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to