AngersZhuuuu opened a new pull request #33810:
URL: https://github.com/apache/spark/pull/33810


   #### What changes were proposed in this pull request?
   This PR fixes an issue that field names of structs generated by arrays_zip 
function could be unexpectedly re-written by analyzer/optimizer.
   Here is an example.
   ```
   val df = sc.parallelize(Seq((Array(1, 2), Array(3, 4)))).toDF("a1", 
"b1").selectExpr("arrays_zip(a1, b1) as zipped")
   df.printSchema
   root
    |-- zipped: array (nullable = true)
    |    |-- element: struct (containsNull = false)
    |    |    |-- a1: integer (nullable = true)                                 
     // OK. a1 is expected name
    |    |    |-- b1: integer (nullable = true)                                 
     // OK. b1 is expected name
   
   df.explain
   == Physical Plan ==
   *(1) Project [arrays_zip(_1#3, _2#4) AS zipped#12]               // Not OK. 
field names are re-written as _1 and _2 respectively
   
   df.write.parquet("/tmp/test.parquet")
   val df2 = spark.read.parquet("/tmp/test.parquet")
   
   df2.printSchema
   root
    |-- zipped: array (nullable = true)
    |    |-- element: struct (containsNull = true)
    |    |    |-- _1: integer (nullable = true)                                 
     // Not OK. a1 is expected but got _1
    |    |    |-- _2: integer (nullable = true)                                 
     // Not OK. b1 is expected but got _2
   ```
   This issue happens when aliases are eliminated by 
AliasHelper.replaceAliasButKeepName or AliasHelper.trimNonTopLevelAliases 
called via analyzer/optimizer
   
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
   
   Line 883 in b89cd8d
   
    upper.map(replaceAliasButKeepName(_, aliases)) 
   
   
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
   
   Line 3759 in b89cd8d
   
    val cleanedProjectList = projectList.map(trimNonTopLevelAliases) 
   
   I investigated functions which can be affected this issue but I found only 
arrays_zip so far.
   To fix this issue, this PR changes the definition of ArraysZip to retain 
field names to avoid being re-written by analyzer/optimizer.
   
   ### Why are the changes needed?
   This is apparently a bug.
   
   ### Does this PR introduce any user-facing change?
   No. After this change, the field names are no longer re-written but it 
should be expected behavior for users.
   
   #### How was this patch tested?
   New tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to