[jira] [Commented] (SPARK-17237) DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException

JIRA Wed, 07 Jun 2017 04:19:04 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040716#comment-16040716
 ]


Alberto Fernández commented on SPARK-17237:
-------------------------------------------

Hi there,

I think this change introduced a breaking change in the way the 
"withColumnRenamed" method works. I can reproduce this breaking change with the 
following example:

{code}
dataframe = sql("SELECT * FROM db.table")
another_dataframe = sql("SELECT * FROM db.another_table")

dataframe
        .join(another_dataframe, on=[...])
        .pivot("column_name", values=[0, 1])
        .max("column1", "column2")
        .withColumnRenamed("0_max(another_table.`column1`)", "name1")
        .withColumnRenamed("0_max(another_table.`column2`)", "name2")
{code}

With Spark 2.1.0, the behaviour is the expected (buggy, but expected): columns 
doesn't get renamed.

With Spark 2.1.1, and if this issue was resolved, you wouldn't need to change 
anything for the renames to work. However, the column doesn't get renamed at 
all because now you would need to use the following renames:

{code}
dataframe = sql("SELECT * FROM db.table")
another_dataframe = sql("SELECT * FROM db.another_table")

dataframe
        .join(another_dataframe, on=[...])
        .pivot("column_name", values=[0, 1])
        .max("column1", "column2")
        .withColumnRenamed("0_max(column1)", "name1")
        .withColumnRenamed("1_max(column2)", "name2")
{code}

As you can see, it seems that this PR somehow managed to removed the table name 
from the join context and also removed the backticks, thus introducing a 
breaking change.

I should also notice that the original issue didn't happen when using JSON as 
output format. It only happens because Parquet doesn't support () characters in 
column names, but in JSON they work just fine. Here is an example of the error 
thrown by Parquet after upgrading to Spark 2.1.1 and not modifying your code.

{code}
Attribute name "0_max(column1)" contains invalid character(s) among " 
,;{}()\\n\\t=". Please use alias to rename it.
{code}

I think the original issue was that the parseAttributeName cannot detect 
"table.column" notation, and as I understand this PR still doesn't fix this 
issue right?

As a workaround, you can change your column renames to accomodate the new 
format.

Any ideas? Am I missing something?

> DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException
> -------------------------------------------------------------------------
>
>                 Key: SPARK-17237
>                 URL: https://issues.apache.org/jira/browse/SPARK-17237
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Jiang Qiqi
>            Assignee: Takeshi Yamamuro
>              Labels: newbie
>             Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> I am trying to run a pivot transformation which I ran on a spark1.6 cluster, 
> namely
> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c")
> res1: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int]
> scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0)
> res2: org.apache.spark.sql.DataFrame = [a: int, 3_count(c): bigint, 3_avg(c): 
> double, 4_count(c): bigint, 4_avg(c): double]
> scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0).show
> +---+----------+--------+----------+--------+
> |  a|3_count(c)|3_avg(c)|4_count(c)|4_avg(c)|
> +---+----------+--------+----------+--------+
> |  2|         1|     4.0|         0|     0.0|
> |  3|         0|     0.0|         1|     5.0|
> +---+----------+--------+----------+--------+
> after upgrade the environment to spark2.0, got an error while executing 
> .na.fill method
> scala> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c")
> res3: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
> scala> res3.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0)
> org.apache.spark.sql.AnalysisException: syntax error in attribute name: 
> `3_count(`c`)`;
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:113)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:921)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:411)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:162)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:159)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:159)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:149)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17237) DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException

Reply via email to