[GitHub] [spark] SaurabhChawla100 edited a comment on pull request #32972: [SPARK-35756][SQL] unionByName supports struct having same col names but different sequence

GitBox Sun, 20 Jun 2021 05:10:09 -0700


SaurabhChawla100 edited a comment on pull request #32972:
URL: https://github.com/apache/spark/pull/32972#issuecomment-864544171



   > They're mostly different issues. This is more of a semantics thing. If you 
have two nested structs with the same fields, but in a different order, you 
have to set `allowMissingCol` to true in order for the structs to be sorted, 
which isn't very intuitive. This is trying to make the `ByName` part apply to 
nested structs as well, and leave `allowMissingCol` to just actually apply to 
missing (possibly nested) columns.
   > 
   > So I do think this idea makes sense, but I don't think the implementation 
handles multiple levels of nested structs correctly. `addFields` assumes adding 
missing columns, so I think you could end up with a case that adds null nested 
columns even if `allowMissingCol` is false.
   > 
   > I think the logic would have to be added to `addFields` to handle whether 
or not it should add null missing columns.
   
    ```So I do think this idea makes sense, but I don't think the 
implementation handles multiple levels of nested structs correctly. addFields 
assumes adding missing columns, so I think you could end up with a case that 
adds null nested columns even if allowMissingCol is false.```
   
   @Kimahriman -  Not able to understand how missing columns added null nested 
column in this PR for allowMissingCol is false.
   
   ```
     case (source: StructType, target: StructType)
               if !allowMissingCol && !source.sameType(target) &&
                 target.toAttributes.map(attr => attr.name).sorted
                   == source.toAttributes.map(x => x.name).sorted =>
               // Having an output with same name, but different struct type.
               // We will sort columns in the struct expression to make sure 
two sides of
               // union have consistent schema.
               aliased += foundAttr
               Alias(addFields(foundAttr, target), foundAttr.name)()
   ```
    In this PR we are only calling the addFields when both source and target 
side have same columns on both sides **( target.toAttributes.map(attr => 
attr.name).sorted == source.toAttributes.map(x => x.name).sorted)**.
   
    
   ```
   val missingFieldsOpt =
         StructType.findMissingFields(col.dataType.asInstanceOf[StructType], 
target, resolver)
   ```
   so this missingFieldsOpt is always empty , and we just do a sorting when 
allowMissingCol is false.
   ```
    if (missingFieldsOpt.isEmpty) {
         sortStructFields(col)
       }
   ```
   
   Please do let me know if my understanding is not correct here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] SaurabhChawla100 edited a comment on pull request #32972: [SPARK-35756][SQL] unionByName supports struct having same col names but different sequence

Reply via email to