[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

jkbradley Mon, 23 Feb 2015 10:42:04 -0800

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4709#discussion_r25188678
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -375,3 +375,55 @@ data2 = labels.zip(normalizer2.transform(features))
     {% endhighlight %}
     </div>
     </div>
    +
    +## Feature selection
    +[Feature selection](http://en.wikipedia.org/wiki/Feature_selection) allows 
selecting the most relevant features for use in model construction. The number 
of features to select can be determined using the validation set. Feature 
selection is usually applied on sparse data, for example in text 
classification. Feature selection reduces the size of the vector space and, in 
turn, the complexity of any subsequent operation with vectors. 
    +
    +### ChiSqSelector
    +ChiSqSelector stands for Chi-Squared feature selection. It operates on the 
labeled data. ChiSqSelector orders categorical features based on their values 
of Chi-Squared test on independence from class and filters (selects) top given 
features.  
    +
    +#### Model Fitting
    +
    
+[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector)
 has the
    +following parameters in the constructor:
    +
    +* `numTopFeatures` number of top features that selector will select 
(filter).
    +
    +We provide a 
[`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) 
method in
    +`ChiSqSelector` which can take an input of `RDD[LabeledPoint]` with 
categorical features, learn the summary statistics, and then
    +return a model which can transform the input dataset into the reduced 
feature space.
    +
    +This model implements 
[`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer)
    +which can apply the Chi-Squared feature selection on a `Vector` to produce 
a reduced `Vector` or on
    +an `RDD[Vector]` to produce a reduced `RDD[Vector]`.
    +
    +Note that the model that performs actual feature filtering can be 
instantiated independently with array of feature indices that has to be sorted 
ascending.
    +
    +#### Example
    +
    +The following example shows the basic use of ChiSqSelector.
    +
    +<div class="codetabs">
    +<div data-lang="scala">
    +{% highlight scala %}
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.util.MLUtils
    +
    +// load some data in libsvm format, each point is in the range 0..255
    +val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
    +// discretize data in 16 equal bins
    +val discretizedData = data.map { lp =>
    +  LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x => x / 
16 } ) )
    +}
    +// create ChiSqSelector that will select 50 features
    +val selector = new ChiSqSelector(50)
    +// create ChiSqSelector model
    +val transformer = selector.fit(disctetizedData)
    +// filter top 50 features
    +val filteredData = transformer.transform(discretizedData)
    --- End diff --
    
    Since transform() takes an RDD[Vector], you'll need to map the data to 
features, and then zip the transformed features with the labels.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

Reply via email to