Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12066#discussion_r61619308
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala ---
    @@ -62,6 +92,14 @@ abstract class Classifier[
       /** @group setParam */
       def setRawPredictionCol(value: String): E = set(rawPredictionCol, 
value).asInstanceOf[E]
     
    +  override def fit(dataset: Dataset[_]): M = {
    +    // This handles a few items such as schema validation.
    +    // Developers only need to implement train().
    +    transformSchema(dataset.schema, logging = true)
    +    copyValues(train(dataset).setParent(this))
    +      .setPredictionMetadata(generatePredictionMetadata(dataset.schema))
    --- End diff --
    
    The scenario I'd like to address is the following:
    * Training data has no metadata for label column
    * Classifier.transformSchema should make a best-effort at adding metadata.  
In this case, it will note the predictionCol is a Nominal value and set its 
name.
    * During training, the algorithm identifies the number of classes from the 
data.
    * The ClassificationModel knows the number of classes, so it should set 
that in the transform and transformSchema calls.
    This happens with NaiveBayes, for example.
    
    Are you referring to the fact that ClassificationModel.transformSchema 
calls ClassifierParams.validateAndTransformSchema, and the 
validateAndTransformSchema method does not take numClasses as an argument?  We 
can modify validateAndTransformSchema to take an optional numClasses argument.
    
    validateAndTransformSchema is technically a public API (unfortunately), so 
we could:
    * deprecate the current validateAndTransformSchema
    * make a new copy which takes the numClasses argument.  This copy could be 
made ```private[classification]```.
    * switch all uses to the new copy
    
    I do think this work is in the scope of this PR; I don't see much benefit 
to adding metadata if we don't make a best-effort attempt, and I don't think it 
will be too much extra code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to