Re: [PR] [HUDI-9468] Parquet Binary Copy at Rowgroup Level [hudi]

via GitHub Mon, 16 Jun 2025 00:07:35 -0700


zhangyue19921010 commented on PR #13365:
URL: https://github.com/apache/hudi/pull/13365#issuecomment-2975344236


   > @zhangyue19921010 @danny0405 pls check the match of table_schema and 
parquet schema before do binary copy. Spark BulkInsert will produce parquets 
with required attribute； however those columns in hudi itself are optional。 Do 
binary copy will corrupt parquet file 
[https://github.com/apache/spark/blob/1a3ae66c6c48bb319f0798826085e694fa7d0b58/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java#L389。](https://github.com/apache/spark/blob/1a3ae66c6c48bb319f0798826085e694fa7d0b58/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java#L389%E3%80%82)
 Suggest: throw exception directly when found above mismatch.
   
   Hi @xiarixiaoyao added all necessary checks in 
`supportBinaryStreamCopy`including 
   1. Check if all input parquet files schema support binary copy    
       1) two level List structure check
       2)  Decimal types stored via INT32/INT64/INT96 check
   2. Check if set of files contains only one type of BloomFilterTypeCode, 
including null
   3. Check if the same column across these files has only one repetition type


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9468] Parquet Binary Copy at Rowgroup Level [hudi]

Reply via email to