Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75314112 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -189,11 +228,35 @@ class ChiSqSelector @Since("1.3.0") ( */ @Since("1.3.0") def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { - val indices = Statistics.chiSqTest(data) - .zipWithIndex.sortBy { case (res, _) => -res.statistic } - .take(numTopFeatures) - .map { case (_, indices) => indices } - .sorted + chiSqTestResult = Statistics.chiSqTest(data) + selectorType match { + case ChiSqSelectorType.KBest => selectKBest(numTopFeatures) + case ChiSqSelectorType.Percentile => selectPercentile(percentile) + case ChiSqSelectorType.Fpr => selectFpr(alpha) + case _ => throw new Exception("Unknown ChiSqSelector Type") + } + } + + @Since("2.1.0") + def selectKBest(value: Int): ChiSqSelectorModel = { --- End diff -- Thanks. I am suggesting how one selector can generate multi models by only fit one time, because we want to reuse the ChiSqTestResult, this result is generated in fit(DataFrame). So we want to only fit one time and can generate multi models. According to your proposal, suppose we want to generate KBest with defaults numTopFeateatures. we can do like this: val selector = new ChiSqSelector() (1) val model = selector.fit(dataframe) (2) then the user want to generate KBest model with only top 20 features. The user can do like this: selector.setNumToFeatures(20) //here is the same selector as (1) , and the model (2) now is with Top 20 features, because it can use the parameters of selector. This is my understanding of your proposal, is my understanding right? Thanks very much.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org