Github user avulanov commented on the pull request:

    https://github.com/apache/spark/pull/1484#issuecomment-51033011
  
    @mengxr 
    1.  Do I understand correct, that you propose that `fit(dataset: 
RDD[LabeledPoint])` should compute feature scores according to the feature 
selection algorithm and `transform(dataset: RDD[LabeledPoint])` should return 
the filtered dataset?
    2.  It seems that such an interface allows misuse when someone calls 
`transform` before `fit`. In some sense it is similar to calling `predict` 
before actually learning the model. This is avoided in MLLib classification 
models implementation by means of `ClassificationModel` interface that has 
`predict` only. Individual classifier has object that returns its instance 
(that does training as well). I like this approach more because it is less 
error-prone from user prospective, but it is a little bit implicit from 
developer's prospective (you need to know that you need to implement a fabric). 
Long story short, why not to seal `fit` inside the constructor or inside the 
object?
    ```
    trait FeatureSelector extends Serializable {
       def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
    }
    //EITHER
    class ChiSquaredFeatureSelector(dataset: RDD[LabeledPoint], numFeatures: 
Int) extends FeatureSelector {
      // perform chi squared computations...
      // implement transform
       override def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
    }
    // OR (like in classification models):
    class ChiSquaredFeatureSelector extends FeatureSelector {
       private def fit(dataset: RDD[LabeledPoint])
      // implement transform
       override def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
    }
    object ChiSquaredFeatureSelector{
       def fit(dataset: RDD[LabeledPoint], numFeatures: Int) {
          val chi = new ChiSquaredFeatureSelector 
          chi.fit
          return chi
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to