Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9965#discussion_r45916646
  
    --- Diff: docs/ml-features.md ---
    @@ -1949,3 +1949,52 @@ output.select("features", "label").show()
     {% endhighlight %}
     </div>
     </div>
    +
    +## ChiSqSelector
    +
    +`ChiSqSelector` stands for Chi-Squared feature selection. It operates on 
labeled data with
    +categorical features. ChiSqSelector orders features based on a
    +[Chi-Squared test of 
independence](https://en.wikipedia.org/wiki/Chi-squared_test)
    +from the class, and then filters (selects) the top features which the 
class label depends on the
    +most. This is akin to yielding the features with the most predictive power.
    +
    +**Examples**
    +
    +Assume that we have a DataFrame with the columns `id`, `features`, and 
`clicked`:
    +
    +~~~
    +id | features              | clicked
    +---|-----------------------|---------
    + 7 | [0.0, 0.0, 18.0, 1.0] | 1.0
    + 8 | [0.0, 1.0, 12.0, 0.0] | 0.0
    + 9 | [1.0, 0.0, 15.0, 0.1] | 0.0
    +~~~
    +
    +If we use `ChiSqSelector` with a `numTopFeatures = 1`, then according to 
our label `clicked` the
    +last column of our `features` is the result:
    --- End diff --
    
    "last column of our `features` is the result" --> "last column in our 
`features` chosen as the most useful feature"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to