GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/3000
[SPARK-4081] [mllib] DatasetIndexer
This introduces a DatasetIndexer class which does the following:
* fit(): collect statistics about how many values each feature in a dataset
(RDD[Vector]) can take
* getCategoricalFeatureIndexes(): use the statistics to choose (a) which
features should be treated as categorical vs. continuous and (b) 0-based
indices for categorical feature values
* transform(): use the result from getCategoricalFeatureIndexes() to
re-index categorical feature values
Currently, this kind of functionality is done on an ad-hoc basis (e.g., for
labels in DecisionTreeRunner). This attempts to standardize it.
The basic usage pattern is:
```
val myData1: RDD[Vector] = ...
val myData2: RDD[Vector] = ...
val datasetIndexer = new DatasetIndexer(maxCategories)
datasetIndexer.fit(myData1)
val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1)
datasetIndexer.fit(myData2)
val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2)
val categoricalFeaturesInfo: Map[Double, Int] =
datasetIndexer.getCategoricalFeatureIndexes()
```
Design notes:
* This maintains sparsity in vectors by ensuring that categorical feature
value 0.0 gets index 0.
* This does not yet support transforming data with new (unknown)
categorical feature values. That can be added later.
* This does not take advantage of sparsity in the input during fit(); it
could be more efficient when given SparseVectors.
CC: @mengxr @manishamde @codedeft This should be helpful for
DecisionTree and RandomForest.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jkbradley/spark indexer
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3000.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3000
----
commit 827518d072dc03d621c4915873468248d2925cc2
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-23T17:35:42Z
working on DatasetIndexer
commit faa0ea71f5a44b9dc8fd4a6c7dc1f7674ca32772
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-27T18:08:16Z
partly done with DatasetIndexerSuite
commit 15cc344bc6b7bef36fb81fb542ffb15d914cf7fe
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-27T23:07:55Z
Merge remote-tracking branch 'upstream/master' into indexer
commit a2957b536ea25150a74507ebc6fda69230762a35
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-27T23:08:14Z
DatasetIndexer now passes tests
commit 228fac6aec115beda8af15526b79f77f2a74023a
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-28T17:27:49Z
Added another test for DatasetIndexer
commit a27e3b55629f0c8cee50cc6ddb2fde609fc0330c
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-29T02:47:33Z
Merge remote-tracking branch 'upstream/master' into indexer
commit b9c43feb374584ebeee37f678b895844dc388e0d
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-29T18:44:19Z
DatasetIndexer now maintains sparsity in SparseVector
commit fc781bdd5325e2a746b99d50d669de07351954fe
Author: Joseph K. Bradley <[email protected]>
Date: 2014-10-29T19:02:52Z
Merge remote-tracking branch 'upstream/master' into indexer
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]