[
https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847786#comment-15847786
]
Joseph K. Bradley edited comment on SPARK-12965 at 2/1/17 12:43 AM:
--------------------------------------------------------------------
I'd say this is both a SQL and MLlib issue, where the MLlib issue is blocked by
the SQL one.
* SQL: {{schema}} handles periods/quotes inconsistently relative to the rest of
the Dataset API
* ML: StringIndexer could avoid using schema.fieldNames and instead use an API
provided by StructType for checking for the existence of a field. That said,
that API needs to be added to StructType...
I'm going to update this issue to be for ML only to handle fixing StringIndexer
and link to a separate JIRA for the SQL issue.
was (Author: josephkb):
I'd say this is both a SQL and MLlib issue.
I'm going to update this issue to be for ML only to handle fixing StringIndexer
and link to a separate JIRA for the SQL issue.
> Indexer setInputCol() doesn't resolve column names like DataFrame.col()
> -----------------------------------------------------------------------
>
> Key: SPARK-12965
> URL: https://issues.apache.org/jira/browse/SPARK-12965
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
> Reporter: Joshua Taylor
> Attachments: SparkMLDotColumn.java
>
>
> The setInputCol() method doesn't seem to resolve column names in the same way
> that other methods do. E.g., Given a DataFrame df, {{df.col("`a.b`")}} will
> return a column. On a StringIndexer indexer,
> {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting
> and transforming seem to have no effect. Running the following code produces:
> {noformat}
> +---+---+--------+
> |a.b|a_b|a_bIndex|
> +---+---+--------+
> |foo|foo| 0.0|
> |bar|bar| 1.0|
> +---+---+--------+
> {noformat}
> but I think it should have another column, {{abIndex}} with the same contents
> as a_bIndex.
> {code}
> public class SparkMLDotColumn {
> public static void main(String[] args) {
> // Get the contexts
> SparkConf conf = new SparkConf()
> .setMaster("local[*]")
> .setAppName("test")
> .set("spark.ui.enabled", "false");
> JavaSparkContext sparkContext = new JavaSparkContext(conf);
> SQLContext sqlContext = new SQLContext(sparkContext);
>
> // Create a schema with a single string column named "a.b"
> StructType schema = new StructType(new StructField[] {
> DataTypes.createStructField("a.b",
> DataTypes.StringType, false)
> });
> // Create an empty RDD and DataFrame
> List<Row> rows = Arrays.asList(RowFactory.create("foo"),
> RowFactory.create("bar"));
> JavaRDD<Row> rdd = sparkContext.parallelize(rows);
> DataFrame df = sqlContext.createDataFrame(rdd, schema);
>
> df = df.withColumn("a_b", df.col("`a.b`"));
>
> StringIndexer indexer0 = new StringIndexer();
> indexer0.setInputCol("a_b");
> indexer0.setOutputCol("a_bIndex");
> df = indexer0.fit(df).transform(df);
>
> StringIndexer indexer1 = new StringIndexer();
> indexer1.setInputCol("`a.b`");
> indexer1.setOutputCol("abIndex");
> df = indexer1.fit(df).transform(df);
>
> df.show();
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]