Github user avulanov commented on the pull request:
https://github.com/apache/spark/pull/1484#issuecomment-51033011
@mengxr
1. Do I understand correct, that you propose that `fit(dataset:
RDD[LabeledPoint])` should compute feature scores according to the feature
selection algorithm and `transform(dataset: RDD[LabeledPoint])` should return
the filtered dataset?
2. It seems that such an interface allows misuse when someone calls
`transform` before `fit`. In some sense it is similar to calling `predict`
before actually learning the model. This is avoided in MLLib classification
models implementation by means of `ClassificationModel` interface that has
`predict` only. Individual classifier has object that returns its instance
(that does training as well). I like this approach more because it is less
error-prone from user prospective, but it is a little bit implicit from
developer's prospective (you need to know that you need to implement a fabric).
Long story short, why not to seal `fit` inside the constructor or inside the
object?
```
trait FeatureSelector extends Serializable {
def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
}
//EITHER
class ChiSquaredFeatureSelector(dataset: RDD[LabeledPoint], numFeatures:
Int) extends FeatureSelector {
// perform chi squared computations...
// implement transform
override def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
}
// OR (like in classification models):
class ChiSquaredFeatureSelector extends FeatureSelector {
private def fit(dataset: RDD[LabeledPoint])
// implement transform
override def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
}
object ChiSquaredFeatureSelector{
def fit(dataset: RDD[LabeledPoint], numFeatures: Int) {
val chi = new ChiSquaredFeatureSelector
chi.fit
return chi
}
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]