Joseph K. Bradley created SPARK-4081:
----------------------------------------

             Summary: Categorical feature indexing
                 Key: SPARK-4081
                 URL: https://issues.apache.org/jira/browse/SPARK-4081
             Project: Spark
          Issue Type: New Feature
          Components: MLlib
    Affects Versions: 1.1.0
            Reporter: Joseph K. Bradley
            Priority: Minor


DecisionTree and RandomForest require that categorical features and labels be 
indexed 0,1,2....  There is currently no code to aid with indexing a dataset.  
This is a proposal for a helper class for computing indices (and also deciding 
which features to treat as categorical).

Proposed functionality:
* This helps process a dataset of unknown vectors into a dataset with some 
continuous features and some categorical features. The choice between 
continuous and categorical is based upon a maxCategories parameter.
* This can also map categorical feature values to 0-based indices.

Usage:
{code}
val myData1: RDD[Vector] = ...
val myData2: RDD[Vector] = ...
val datasetIndexer = new DatasetIndexer(maxCategories)
datasetIndexer.fit(myData1)
val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1)
datasetIndexer.fit(myData2)
val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2)
val categoricalFeaturesInfo: Map[Int, Int] = 
datasetIndexer.getCategoricalFeaturesInfo()
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to