Hello Wei
 Thanks, i should have c hecked the data
My data has this format
|col1|col2|col3|label|

so it looks like i cannot use VectorIndexer directly (it accepts a Vector
column).
I am guessing what i should do is something like this (given i have few
categorical features)

val assembler = new VectorAssembler().
      setInputCols(inputData.columns.filter(_ != "Label")).
      setOutputCol("features")

    val transformedData = assembler.transform(inputData)


    val featureIndexer =
      new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexedFeatures")
      .setMaxCategories(5) // features with > 4 distinct values are treated
as continuous.
      .fit(transformedData)

?
Apologies for the basic question btu last time i worked on an ML project i
was using Spark 1.x

kr
 marco









On Dec 16, 2017 1:24 PM, "Weichen Xu" <weichen...@databricks.com> wrote:

> Hi, Marco,
>
> val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_
> data.txt")
>
> The data now include a feature column with name "features",
>
> val featureIndexer = new VectorIndexer()
>   .setInputCol("features")   <------ Here specify the "features" column to 
> index.
>   .setOutputCol("indexedFeatures")
>
>
> Thanks.
>
>
> On Sat, Dec 16, 2017 at 6:26 AM, Marco Mistroni <mmistr...@gmail.com>
> wrote:
>
>> HI all
>>  i am trying to run a sample decision tree, following examples here (for
>> Mllib)
>>
>> https://spark.apache.org/docs/latest/ml-classification-regre
>> ssion.html#decision-tree-classifier
>>
>> the example seems to use  a Vectorindexer, however i am missing something.
>> How does the featureIndexer knows which columns are features?
>> Isnt' there something missing?  or the featuresIndexer is able to figure
>> out by itself
>> which columns of teh DAtaFrame are features?
>>
>> val labelIndexer = new StringIndexer()
>>   .setInputCol("label")
>>   .setOutputCol("indexedLabel")
>>   .fit(data)// Automatically identify categorical features, and index 
>> them.val featureIndexer = new VectorIndexer()
>>   .setInputCol("features")
>>   .setOutputCol("indexedFeatures")
>>   .setMaxCategories(4) // features with > 4 distinct values are treated as 
>> continuous.
>>   .fit(data)
>>
>> Using this code i am getting back this exception
>>
>> Exception in thread "main" java.lang.IllegalArgumentException: Field 
>> "features" does not exist.
>>         at 
>> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
>>         at 
>> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
>>         at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>>         at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>>         at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
>>         at 
>> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
>>         at 
>> org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141)
>>         at 
>> org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
>>         at 
>> org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118)
>>
>> what am i missing?
>>
>> w/kindest regarsd
>>
>>  marco
>>
>>
>

Reply via email to