Hello Wei Thanks, i should have c hecked the data My data has this format |col1|col2|col3|label|
so it looks like i cannot use VectorIndexer directly (it accepts a Vector column). I am guessing what i should do is something like this (given i have few categorical features) val assembler = new VectorAssembler(). setInputCols(inputData.columns.filter(_ != "Label")). setOutputCol("features") val transformedData = assembler.transform(inputData) val featureIndexer = new VectorIndexer() .setInputCol("features") .setOutputCol("indexedFeatures") .setMaxCategories(5) // features with > 4 distinct values are treated as continuous. .fit(transformedData) ? Apologies for the basic question btu last time i worked on an ML project i was using Spark 1.x kr marco On Dec 16, 2017 1:24 PM, "Weichen Xu" <weichen...@databricks.com> wrote: > Hi, Marco, > > val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_ > data.txt") > > The data now include a feature column with name "features", > > val featureIndexer = new VectorIndexer() > .setInputCol("features") <------ Here specify the "features" column to > index. > .setOutputCol("indexedFeatures") > > > Thanks. > > > On Sat, Dec 16, 2017 at 6:26 AM, Marco Mistroni <mmistr...@gmail.com> > wrote: > >> HI all >> i am trying to run a sample decision tree, following examples here (for >> Mllib) >> >> https://spark.apache.org/docs/latest/ml-classification-regre >> ssion.html#decision-tree-classifier >> >> the example seems to use a Vectorindexer, however i am missing something. >> How does the featureIndexer knows which columns are features? >> Isnt' there something missing? or the featuresIndexer is able to figure >> out by itself >> which columns of teh DAtaFrame are features? >> >> val labelIndexer = new StringIndexer() >> .setInputCol("label") >> .setOutputCol("indexedLabel") >> .fit(data)// Automatically identify categorical features, and index >> them.val featureIndexer = new VectorIndexer() >> .setInputCol("features") >> .setOutputCol("indexedFeatures") >> .setMaxCategories(4) // features with > 4 distinct values are treated as >> continuous. >> .fit(data) >> >> Using this code i am getting back this exception >> >> Exception in thread "main" java.lang.IllegalArgumentException: Field >> "features" does not exist. >> at >> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266) >> at >> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266) >> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) >> at scala.collection.AbstractMap.getOrElse(Map.scala:59) >> at org.apache.spark.sql.types.StructType.apply(StructType.scala:265) >> at >> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40) >> at >> org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141) >> at >> org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) >> at >> org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118) >> >> what am i missing? >> >> w/kindest regarsd >> >> marco >> >> >