[ https://issues.apache.org/jira/browse/SPARK-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392055#comment-14392055 ]
Joseph K. Bradley commented on SPARK-4081: ------------------------------------------ Note: I am adding this to the spark.ml API since it is friendlier for feature transformers; I will not add it to the spark.mllib API. > Categorical feature indexing > ---------------------------- > > Key: SPARK-4081 > URL: https://issues.apache.org/jira/browse/SPARK-4081 > Project: Spark > Issue Type: New Feature > Components: MLlib > Affects Versions: 1.1.0 > Reporter: Joseph K. Bradley > Assignee: Joseph K. Bradley > Priority: Minor > > DecisionTree and RandomForest require that categorical features and labels be > indexed 0,1,2.... There is currently no code to aid with indexing a dataset. > This is a proposal for a helper class for computing indices (and also > deciding which features to treat as categorical). > Proposed functionality: > * This helps process a dataset of unknown vectors into a dataset with some > continuous features and some categorical features. The choice between > continuous and categorical is based upon a maxCategories parameter. > * This can also map categorical feature values to 0-based indices. > Usage: > {code} > val myData1: RDD[Vector] = ... > val myData2: RDD[Vector] = ... > val datasetIndexer = new DatasetIndexer(maxCategories) > datasetIndexer.fit(myData1) > val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1) > datasetIndexer.fit(myData2) > val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2) > val categoricalFeaturesInfo: Map[Double, Int] = > datasetIndexer.getCategoricalFeatureIndexes() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org