[I] [Bug]: Issue with Merging Parquet Files Without Field ID Leading to Misaligned Columns [amoro]

via GitHub Tue, 20 Aug 2024 01:44:07 -0700


wangmingjin163 opened a new issue, #3120:
URL: https://github.com/apache/amoro/issues/3120


   ### What happened?
   
   I encountered an issue when working with Parquet files in the Amoro project. 
The problem arises when Parquet files are written by using  Arrow Schema  
without Field IDs, which later causes issues during file merging operations. 
Specifically, the columns in the merged files become misaligned, resulting in 
incorrect data projections.
   ![Screenshot 2024-08-20 at 16 41 
11](https://github.com/user-attachments/assets/07eea7f2-6316-4b43-9078-8e5fe6d799c1)
   ![Screenshot 2024-08-20 at 16 42 
46](https://github.com/user-attachments/assets/d281d539-9c93-47f8-89a4-d18f32a3d946)
   
   
   ### Affects Versions
   
   0.7.0
   
   ### What table formats are you seeing the problem on?
   
   Iceberg
   
   ### What engines are you seeing the problem on?
   
   Optimizer
   
   ### How to reproduce
   
   1.Create Parquet files using Iceberg schema without including Field IDs.
   2.Attempt to merge these Parquet files using Iceberg’s rewriteDataFiles 
method.
   3.Observe that the columns in the merged files are misaligned.
   
   ### Relevant log output
   
   _No response_
   
   ### Anything else
   
   Proposed Solution:
   I added a check to apply NameMapping during the Parquet file reading 
process. This ensures that fields are correctly mapped by name to their 
corresponding IDs, preventing misalignment during merging.
   
   The key part of the solution involves using 
withNameMapping(NameMappingParser.fromJson(nameMapping)) in the 
Parquet.ReadBuilder when opening Parquet files. This ensures that the schema 
mapping is handled correctly, even in the absence of Field IDs.
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's Code of Conduct


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Bug]: Issue with Merging Parquet Files Without Field ID Leading to Misaligned Columns [amoro]

Reply via email to