Github user avulanov commented on the pull request:
https://github.com/apache/spark/pull/1484#issuecomment-51168631
@mengxr
1) I also have concerns regarding the mentioned two options. Throwing an
error means to have a method that returns an error when it is called with valid
parameters. Calling `fit` inside `transform` will cause a question what the
next `fit` call will do.
2) Could you explain how the upper bound like `[T <: Vectorized with
Labeled]` can be implemented? `LabeledPoint` is a case class with no class
hierarchy or traits.
3) It seems that all implementations of transform will do the same: filter
features by index. I propose to implement such a filter. It also will solve the
problem of filtering both `LabeledPoint` and `Vector`:
```
trait FeatureFilter {
val indices: Set[Int]
def transform(RDD[LabeledPoint]: data) = data.map { lp => new
LabeledPoint(lp.label, Compress(lp.features, indices)) }
def transform(RDD[Vector]: data) = data.map { v => Compress(v, indices) }
}
object Compress {
def apply(features: Vector, indexes: Set[Int]): Vector = {
val (values, _) =
features.toArray.zipWithIndex.filter { case (value, index) =>
indexes.contains(index)}.unzip
Vectors.dense(values.toArray)
}
}
class ChiSquaredFeatureSelection(RDD[LabeledPoint]: data, Int: numFeatures)
extends FeatureFilter {
// compute chiSquared and return feature indices
featureIndices = {....}
}
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]