xuanyuanking opened a new pull request #25768: [SPARK-29063][SQL] Modify 
fillValue approach to support joined dataframe
URL: https://github.com/apache/spark/pull/25768
 
 
   ### What changes were proposed in this pull request?
   Modify the approach in `DataFrameNaFunctions.fillValue`, the new one uses 
`df.withColumns` which only address the columns need to be filled. After this 
change, there are no more ambiguous fileds detected for joined dataframe.
   
   ### Why are the changes needed?
   Before this change, when you have a joined table that has the same field 
name from both original table, fillna will fail even if you specify a subset 
that does not include the 'ambiguous' fields.
   ```
   scala> val df1 = Seq(("f1-1", "f2", null), ("f1-2", null, null), ("f1-3", 
"f2", "f3-1"), ("f1-4", "f2", "f3-1")).toDF("f1", "f2", "f3")
   scala> val df2 = Seq(("f1-1", null, null), ("f1-2", "f2", null), ("f1-3", 
"f2", "f4-1")).toDF("f1", "f2", "f4")
   scala> val df_join = df1.alias("df1").join(df2.alias("df2"), Seq("f1"), 
joinType="left_outer")
   scala> df_join.na.fill("", cols=Seq("f4"))
   
   org.apache.spark.sql.AnalysisException: Reference 'f2' is ambiguous, could 
be: df1.f2, df2.f2.;
   ```
   
   ### Does this PR introduce any user-facing change?
   Yes, fillna operation will pass and give the right answer for a joined table.
   
   
   ### How was this patch tested?
   Local test and newly added UT.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to