zero323 commented on issue #25776: [SPARK-28985][PYTHON][ML] Add common classes (JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier) in PYTHON URL: https://github.com/apache/spark/pull/25776#issuecomment-533488999 @zhengruifeng **Use cases for public ML type hierarchy** - Add Python-only Transformet implementations: - I am Python user and want to implement pure Python ML classifier without providing JVM backend. - I want this classifier to be meaningfully positioned in the existing type hierarchy. - However I have access only to high level classes (`Estimator`, `Model`, `MLReader` / `MLReadable`). - Run time parameter validation for both user defined (see above) and existing class hierarchy, - I am a library developer who provides functions that are meaningful only for specific categories of `Estimators` - here classifiers. - I want to validate that user passed argument is indeed a classifier: - For built-in objects using "private" type hierarchy is not really satisfying (actually, what is the rationale behind making it "private"? If the goal is Scala API parity, and Scala counterparts are public, shouldn't these be too?). - For user defined objects I can: - Use duck typing (on `setRawPredictionCol` for classifier, on `numClasses` for classification model) but it hardly satisfying. - Provide parallel non-abstract type hierarchy (`Classifier` or `PythonClassifier` and so on) and require users to implement such interfaces. That however would require separate logic for checking for built-in and and user-provided classes. - Provide parallel abstract type hierarchy, register all existing built-in classes and require users to do the same. Clearly these are not satisfying solutions as they require either defensive programming or reinventing the same functionality for different 3rd party APIs. - Static type checking - I am either end user or library developer and want to use PEP-484 annotations to indicate components that require classifier or classification model. - Currently I can provide only imprecise annotations, [such as](https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241) ```python def setClassifier(self, value: Estimator[M]) -> OneVsRest: ... ``` or try to narrow things down using structural subtyping: ```python class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, value: str) -> Classifier: ... class Classifier(Protocol, Model): def setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> int: ... ``` **State before this PR** As describe above. None of the described scenarios is really supported and verbose and repetitive workarounds are required. **Impact of this PR** > how newly add common classes in this PR affects the end users to implement their own hierarchy Strictly speaking this PR doesn't doesn't change anything. However it - Solidifies and extends existing, IMHO highly insufficient structure, with it's "virtually private" classes and subordination to JVM interop. This can make any future improvements less likely. - Is a lost opportunity and, at least to some extent, fails to fulfill its own goal to _"have the same structure as Scala"_ **Proposal** Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e. ```python class ClassifierParams: ... class Predictor(Estimator,PredictorParams): def setLabelCol(self, value): ... def setFeaturesCol(self, value): ... def setPredictionCol(self, value): ... class Classifier(Predictor, ClassifierParams): def setRawPredictionCol(self, value): ... class PredictionModel(Model,PredictorParams): def setFeaturesCol def setPredictionCol def numFeatures def predict ``` and JVM interop should extend from this hierarchy, i.e. ```python class JavaPredictionModel(PredictionModel): ... ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
