imback82 commented on a change in pull request #26593: [SPARK-29890][SQL] 
DataFrameNaFunctions.fill should handle duplicate columns
URL: https://github.com/apache/spark/pull/26593#discussion_r349209357
 
 

 ##########
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala
 ##########
 @@ -468,12 +477,26 @@ final class DataFrameNaFunctions private[sql](df: 
DataFrame) {
       s"Unsupported value type ${v.getClass.getName} ($v).")
   }
 
+  private def toAttributes(cols: Seq[String]): Seq[Attribute] = {
+    def resolve(colName: String) : Attribute = {
+      df.col(colName).named.toAttribute match {
+        case a: Attribute => a
+        case _ => throw new IllegalArgumentException(s"'$colName' is not a top 
level column.")
 
 Review comment:
   (Let me merge different threads here)
   
   The previous implementation resolved the column names as follow:
   1) Check the column names against the schema (`df.schema.fields`)
   2) Only the column names that matched the top level fields were used for 
`df.col`.
   
   So, if we want to keep the previous behavior of `fill` (for handling `*`, 
and etc.), we first need to check the column names against the schema, then do 
`df.col(colName).named.toAttribute`. 
   
   Then, the nested field will not pass the schema check (since it's checking 
against the top-level fields), and `toAttribute` will always return 
`Attribute`. So, I am not sure how this branch can be tested. What do you think?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to