[jira] [Updated] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()

Joshua Taylor (JIRA) Fri, 22 Jan 2016 11:24:56 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joshua Taylor updated SPARK-12965:
----------------------------------
    Description: 
The setInputCol() method doesn't seem to resolve column names in the same way 
that other methods do.  E.g., Given a DataFrame df, {{df.col("`a.b`")}} will 
return a column.  On a StringIndexer indexer, {{indexer.setInputCol("`a.b`")}} 
produces leads to an indexer where fitting and transforming seem to have no 
effect.  Running the following code produces:
{noformat}
+---+---+--------+
|a.b|a_b|a_bIndex|
+---+---+--------+
|foo|foo|     0.0|
|bar|bar|     1.0|
+---+---+--------+
{noformat}
but I think it should have another column, {{abIndex}} with the same contents 
as a_bIndex.

{code}
public class SparkMLDotColumn {
        public static void main(String[] args) {
                // Get the contexts
                SparkConf conf = new SparkConf()
                                .setMaster("local[*]")
                                .setAppName("test")
                                .set("spark.ui.enabled", "false");
                JavaSparkContext sparkContext = new JavaSparkContext(conf);
                SQLContext sqlContext = new SQLContext(sparkContext);
                
                // Create a schema with a single string column named "a.b"
                StructType schema = new StructType(new StructField[] {
                                DataTypes.createStructField("a.b", 
DataTypes.StringType, false)
                });

                // Create an empty RDD and DataFrame
                List<Row> rows = Arrays.asList(RowFactory.create("foo"), 
RowFactory.create("bar")); 
                JavaRDD<Row> rdd = sparkContext.parallelize(rows);
                DataFrame df = sqlContext.createDataFrame(rdd, schema);
                
                df = df.withColumn("a_b", df.col("`a.b`"));
                
                StringIndexer indexer0 = new StringIndexer();
                indexer0.setInputCol("a_b");
                indexer0.setOutputCol("a_bIndex");
                df = indexer0.fit(df).transform(df);
                
                StringIndexer indexer1 = new StringIndexer();
                indexer1.setInputCol("`a.b`");
                indexer1.setOutputCol("abIndex");
                df = indexer1.fit(df).transform(df);
                
                df.show();
        }
}
{code}

  was:
The setInputCol() method doesn't seem to resolve column names in the same way 
that other methods do.  E.g., Given a DataFrame df, {{df.col("`a.b`")}} will 
return a column.  On a StringIndexer indexer, {{indexer.setInputCol("`a.b`")}} 
produces leads to an indexer where fitting and transforming seem to have no 
effect.  Running the attached code produces:

{noformat}
+---+---+--------+
|a.b|a_b|a_bIndex|
+---+---+--------+
|foo|foo|     0.0|
|bar|bar|     1.0|
+---+---+--------+
{noformat}

but I think it should have another column, {{abIndex}} with the same contents 
as a_bIndex.


> Indexer setInputCol() doesn't resolve column names like DataFrame.col()
> -----------------------------------------------------------------------
>
>                 Key: SPARK-12965
>                 URL: https://issues.apache.org/jira/browse/SPARK-12965
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, Spark Core
>    Affects Versions: 1.6.0
>            Reporter: Joshua Taylor
>         Attachments: SparkMLDotColumn.java
>
>
> The setInputCol() method doesn't seem to resolve column names in the same way 
> that other methods do.  E.g., Given a DataFrame df, {{df.col("`a.b`")}} will 
> return a column.  On a StringIndexer indexer, 
> {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting 
> and transforming seem to have no effect.  Running the following code produces:
> {noformat}
> +---+---+--------+
> |a.b|a_b|a_bIndex|
> +---+---+--------+
> |foo|foo|     0.0|
> |bar|bar|     1.0|
> +---+---+--------+
> {noformat}
> but I think it should have another column, {{abIndex}} with the same contents 
> as a_bIndex.
> {code}
> public class SparkMLDotColumn {
>       public static void main(String[] args) {
>               // Get the contexts
>               SparkConf conf = new SparkConf()
>                               .setMaster("local[*]")
>                               .setAppName("test")
>                               .set("spark.ui.enabled", "false");
>               JavaSparkContext sparkContext = new JavaSparkContext(conf);
>               SQLContext sqlContext = new SQLContext(sparkContext);
>               
>               // Create a schema with a single string column named "a.b"
>               StructType schema = new StructType(new StructField[] {
>                               DataTypes.createStructField("a.b", 
> DataTypes.StringType, false)
>               });
>               // Create an empty RDD and DataFrame
>               List<Row> rows = Arrays.asList(RowFactory.create("foo"), 
> RowFactory.create("bar")); 
>               JavaRDD<Row> rdd = sparkContext.parallelize(rows);
>               DataFrame df = sqlContext.createDataFrame(rdd, schema);
>               
>               df = df.withColumn("a_b", df.col("`a.b`"));
>               
>               StringIndexer indexer0 = new StringIndexer();
>               indexer0.setInputCol("a_b");
>               indexer0.setOutputCol("a_bIndex");
>               df = indexer0.fit(df).transform(df);
>               
>               StringIndexer indexer1 = new StringIndexer();
>               indexer1.setInputCol("`a.b`");
>               indexer1.setOutputCol("abIndex");
>               df = indexer1.fit(df).transform(df);
>               
>               df.show();
>       }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()

Reply via email to