ZiyaZa opened a new pull request, #52758:
URL: https://github.com/apache/spark/pull/52758

   ### What changes were proposed in this pull request?
   
   This PR fixes a bug from #52557, where we are reading an additional field if 
all the requested fields of a struct are missing from the Parquet file. We used 
to always pick the cheapest leaf column of the struct. However, if this leaf 
was inside a Map column, then we'd generate an invalid Map type like the 
following:
   
   ```
   optional group _1 (MAP) {
     repeated group key_value {
       required boolean key;
     }
   }
   ```
   
   Since there is no `value` field in this group, we'd fail later when trying 
to convert this Parquet type to a Spark type. This PR changes the additional 
field selection logic to enforce selecting a field from both the key and the 
value of the map, which can now give us a type like following:
   
   ```
   optional group _1 (MAP) {
     repeated group key_value {
       required boolean key;
       optional group value {
         optional int32 _2;
       }
     }
   }
   ```
   
   
   ### Why are the changes needed?
   
   To fix a critical bug where we would throw an exception when reading a 
Parquet file.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New unit tests.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to