voonhous commented on PR #17904:
URL: https://github.com/apache/hudi/pull/17904#issuecomment-3764202679

   Yeap, parquetReader does null padding. This is why `InternalSchemaMerger` 
adds a suffix to the schema of a column that has a field with 2 different 
`fieldId`s of different version.
   
   This occurs during an operation like this:
   
   schema_v1:
   ```
   col_a STRING -- id: 1
   col_b INT -- id: 2
   ```
   
   schema operation:
   ```
   DROP col_b;
   ADD col_b UUID;
   ```
   
   schema_v2:
   ```
   col_a STRING -- id: 1
   col_b UUID -- id: 3
   ```
   
   When reading the latest snapshot, hudi will pass this schema to 
parquetReader for older filegroups of the latest snapshot that were written 
before the schema evolution operation:
   
   ```
   col_a, col_bsuffix
   ```
   
   This will cause `col_b` to return nulls, i.e. old data not be read out as 
the latest snapshot calls for `col_b` with a UUID type.
   
   The unsafe projection makes sense to me by returning null LITERALs of the 
column's latest snapshot's data type. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to