Re: [PR] [HUDI-9468] Parquet Binary Copy at Rowgroup Level [hudi]

via GitHub Tue, 10 Jun 2025 08:02:45 -0700


zhangyue19921010 commented on PR #13365:
URL: https://github.com/apache/hudi/pull/13365#issuecomment-2959575781


   > @zhangyue19921010 @danny0405 pls check the match of table_schema and 
parquet schema before do binary copy. Spark BulkInsert will produce parquets 
with required attribute； however those columns in hudi itself are optional。 Do 
binary copy will corrupt parquet file 
[https://github.com/apache/spark/blob/1a3ae66c6c48bb319f0798826085e694fa7d0b58/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java#L389。](https://github.com/apache/spark/blob/1a3ae66c6c48bb319f0798826085e694fa7d0b58/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java#L389%E3%80%82)
 Suggest: throw exception directly when found above mismatch.
   
   Thanks for your reminding. We notice this so that we use 
`alignFieldsNullability` func to resolve nullable inconsistencies issue. 
However, for the existing data file, we still need to do some pre-verification, 
which I will add later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9468] Parquet Binary Copy at Rowgroup Level [hudi]

Reply via email to