Github user mpjlu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14597#discussion_r75314112
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
    @@ -189,11 +228,35 @@ class ChiSqSelector @Since("1.3.0") (
        */
       @Since("1.3.0")
       def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
    -    val indices = Statistics.chiSqTest(data)
    -      .zipWithIndex.sortBy { case (res, _) => -res.statistic }
    -      .take(numTopFeatures)
    -      .map { case (_, indices) => indices }
    -      .sorted
    +    chiSqTestResult = Statistics.chiSqTest(data)
    +    selectorType match {
    +      case ChiSqSelectorType.KBest => selectKBest(numTopFeatures)
    +      case ChiSqSelectorType.Percentile => selectPercentile(percentile)
    +      case ChiSqSelectorType.Fpr => selectFpr(alpha)
    +      case _ => throw new Exception("Unknown ChiSqSelector Type")
    +    }
    +  }
    +
    +  @Since("2.1.0")
    +  def selectKBest(value: Int): ChiSqSelectorModel = {
    --- End diff --
    
    Thanks. I am suggesting how one selector can generate multi models by only 
fit one time, because we want to reuse the ChiSqTestResult, this result is 
generated in fit(DataFrame). So we want to only fit one time and can generate 
multi models.   
    According to your proposal, suppose we want to generate KBest with defaults 
numTopFeateatures. we can do like this:
    val selector = new ChiSqSelector()          (1)
    val model = selector.fit(dataframe)         (2)
    then the user want to generate KBest model with only top 20 features.  The 
user can do like this:
    selector.setNumToFeatures(20)     //here is the same selector as (1) , and 
the model (2) now is with Top 20 features, because it can use the parameters of 
selector. This is my understanding of your proposal, is my understanding right? 
  Thanks very much. 
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to