Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/1484#issuecomment-51086836
@avulanov I have the same concern about calling `transform` before `fit`.
There are two options: 1) throw an error, 2) fit on the same dataset and then
transform (fit_transform in sk-learn). But I don't have a strong preference of
either one.
I want to add another candidate to what you proposed:
~~~
class ChiSquaredFeatureSelection {
def fit(dataset: RDD[LabeledPoint], numFeatures: Int):
ChiSquaredFeatureSelector
}
class ChiSquaredFeatureSelector {
def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
}
~~~
We can discuss the class hierarchy later since they are not user-facing.
A problem with all the candidates here is we cannot apply the same
transformation on `RDD[Vector]`, which is required for prediction. I'm thinking
about something like the following:
~~~
class ChiSquaredFeatureSelection {
def fit[T <: Vectorized with Labeled](dataset: RDD[T], numFeatures:
Int): ChiSquaredFeatureSelector
}
class ChiSquaredFeatureSelector {
def transform[T <: Vectorized](dataset: RDD[T]): RDD[T]
}
~~~
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]