[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

jkbradley Wed, 29 Oct 2014 12:11:37 -0700

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/3000


    [SPARK-4081] [mllib]  DatasetIndexer

    This introduces a DatasetIndexer class which does the following:
    * fit(): collect statistics about how many values each feature in a dataset 
(RDD[Vector]) can take
    * getCategoricalFeatureIndexes(): use the statistics to choose (a) which 
features should be treated as categorical vs. continuous and (b) 0-based 
indices for categorical feature values
    * transform(): use the result from getCategoricalFeatureIndexes() to 
re-index categorical feature values
    
    Currently, this kind of functionality is done on an ad-hoc basis (e.g., for 
labels in DecisionTreeRunner).  This attempts to standardize it.
    
    The basic usage pattern is:
    ```
    val myData1: RDD[Vector] = ...
    val myData2: RDD[Vector] = ...
    val datasetIndexer = new DatasetIndexer(maxCategories)
    datasetIndexer.fit(myData1)
    val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1)
    datasetIndexer.fit(myData2)
    val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2)
    val categoricalFeaturesInfo: Map[Double, Int] = 
datasetIndexer.getCategoricalFeatureIndexes()
    ```
    
    Design notes:
    * This maintains sparsity in vectors by ensuring that categorical feature 
value 0.0 gets index 0.
    * This does not yet support transforming data with new (unknown) 
categorical feature values.  That can be added later.
    * This does not take advantage of sparsity in the input during fit(); it 
could be more efficient when given SparseVectors.
    
    CC: @mengxr  @manishamde  @codedeft  This should be helpful for 
DecisionTree and RandomForest.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark indexer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3000.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3000
    
----
commit 827518d072dc03d621c4915873468248d2925cc2
Author: Joseph K. Bradley <[email protected]>
Date:   2014-10-23T17:35:42Z

    working on DatasetIndexer

commit faa0ea71f5a44b9dc8fd4a6c7dc1f7674ca32772
Author: Joseph K. Bradley <[email protected]>
Date:   2014-10-27T18:08:16Z

    partly done with DatasetIndexerSuite

commit 15cc344bc6b7bef36fb81fb542ffb15d914cf7fe
Author: Joseph K. Bradley <[email protected]>
Date:   2014-10-27T23:07:55Z

    Merge remote-tracking branch 'upstream/master' into indexer

commit a2957b536ea25150a74507ebc6fda69230762a35
Author: Joseph K. Bradley <[email protected]>
Date:   2014-10-27T23:08:14Z

    DatasetIndexer now passes tests

commit 228fac6aec115beda8af15526b79f77f2a74023a
Author: Joseph K. Bradley <[email protected]>
Date:   2014-10-28T17:27:49Z

    Added another test for DatasetIndexer

commit a27e3b55629f0c8cee50cc6ddb2fde609fc0330c
Author: Joseph K. Bradley <[email protected]>
Date:   2014-10-29T02:47:33Z

    Merge remote-tracking branch 'upstream/master' into indexer

commit b9c43feb374584ebeee37f678b895844dc388e0d
Author: Joseph K. Bradley <[email protected]>
Date:   2014-10-29T18:44:19Z

    DatasetIndexer now maintains sparsity in SparseVector

commit fc781bdd5325e2a746b99d50d669de07351954fe
Author: Joseph K. Bradley <[email protected]>
Date:   2014-10-29T19:02:52Z

    Merge remote-tracking branch 'upstream/master' into indexer

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4081] [mllib] DatasetIndexer

Reply via email to