[ https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joshua Taylor updated SPARK-12965: ---------------------------------- Description: The setInputCol() method doesn't seem to resolve column names in the same way that other methods do. E.g., Given a DataFrame df, {{df.col("`a.b`")}} will return a column. On a StringIndexer indexer, {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting and transforming seem to have no effect. Running the following code produces: {noformat} +---+---+--------+ |a.b|a_b|a_bIndex| +---+---+--------+ |foo|foo| 0.0| |bar|bar| 1.0| +---+---+--------+ {noformat} but I think it should have another column, {{abIndex}} with the same contents as a_bIndex. {code} public class SparkMLDotColumn { public static void main(String[] args) { // Get the contexts SparkConf conf = new SparkConf() .setMaster("local[*]") .setAppName("test") .set("spark.ui.enabled", "false"); JavaSparkContext sparkContext = new JavaSparkContext(conf); SQLContext sqlContext = new SQLContext(sparkContext); // Create a schema with a single string column named "a.b" StructType schema = new StructType(new StructField[] { DataTypes.createStructField("a.b", DataTypes.StringType, false) }); // Create an empty RDD and DataFrame List<Row> rows = Arrays.asList(RowFactory.create("foo"), RowFactory.create("bar")); JavaRDD<Row> rdd = sparkContext.parallelize(rows); DataFrame df = sqlContext.createDataFrame(rdd, schema); df = df.withColumn("a_b", df.col("`a.b`")); StringIndexer indexer0 = new StringIndexer(); indexer0.setInputCol("a_b"); indexer0.setOutputCol("a_bIndex"); df = indexer0.fit(df).transform(df); StringIndexer indexer1 = new StringIndexer(); indexer1.setInputCol("`a.b`"); indexer1.setOutputCol("abIndex"); df = indexer1.fit(df).transform(df); df.show(); } } {code} was: The setInputCol() method doesn't seem to resolve column names in the same way that other methods do. E.g., Given a DataFrame df, {{df.col("`a.b`")}} will return a column. On a StringIndexer indexer, {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting and transforming seem to have no effect. Running the attached code produces: {noformat} +---+---+--------+ |a.b|a_b|a_bIndex| +---+---+--------+ |foo|foo| 0.0| |bar|bar| 1.0| +---+---+--------+ {noformat} but I think it should have another column, {{abIndex}} with the same contents as a_bIndex. > Indexer setInputCol() doesn't resolve column names like DataFrame.col() > ----------------------------------------------------------------------- > > Key: SPARK-12965 > URL: https://issues.apache.org/jira/browse/SPARK-12965 > Project: Spark > Issue Type: Bug > Components: ML, Spark Core > Affects Versions: 1.6.0 > Reporter: Joshua Taylor > Attachments: SparkMLDotColumn.java > > > The setInputCol() method doesn't seem to resolve column names in the same way > that other methods do. E.g., Given a DataFrame df, {{df.col("`a.b`")}} will > return a column. On a StringIndexer indexer, > {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting > and transforming seem to have no effect. Running the following code produces: > {noformat} > +---+---+--------+ > |a.b|a_b|a_bIndex| > +---+---+--------+ > |foo|foo| 0.0| > |bar|bar| 1.0| > +---+---+--------+ > {noformat} > but I think it should have another column, {{abIndex}} with the same contents > as a_bIndex. > {code} > public class SparkMLDotColumn { > public static void main(String[] args) { > // Get the contexts > SparkConf conf = new SparkConf() > .setMaster("local[*]") > .setAppName("test") > .set("spark.ui.enabled", "false"); > JavaSparkContext sparkContext = new JavaSparkContext(conf); > SQLContext sqlContext = new SQLContext(sparkContext); > > // Create a schema with a single string column named "a.b" > StructType schema = new StructType(new StructField[] { > DataTypes.createStructField("a.b", > DataTypes.StringType, false) > }); > // Create an empty RDD and DataFrame > List<Row> rows = Arrays.asList(RowFactory.create("foo"), > RowFactory.create("bar")); > JavaRDD<Row> rdd = sparkContext.parallelize(rows); > DataFrame df = sqlContext.createDataFrame(rdd, schema); > > df = df.withColumn("a_b", df.col("`a.b`")); > > StringIndexer indexer0 = new StringIndexer(); > indexer0.setInputCol("a_b"); > indexer0.setOutputCol("a_bIndex"); > df = indexer0.fit(df).transform(df); > > StringIndexer indexer1 = new StringIndexer(); > indexer1.setInputCol("`a.b`"); > indexer1.setOutputCol("abIndex"); > df = indexer1.fit(df).transform(df); > > df.show(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org