[PR] [Draft] Paimon Source Support [incubator-xtable]

via GitHub Sun, 07 Sep 2025 22:46:42 -0700


mikedias opened a new pull request, #742:
URL: https://github.com/apache/incubator-xtable/pull/742


   Issue: https://github.com/apache/incubator-xtable/issues/275 
   
   Draft PR to add support for [Apache Paimon](https://paimon.apache.org/) as 
the source table. This is an early sharing to collect feedback on whether the 
approach is right and on a few issues I've encountered. 🙂 
   
   Things that are missing yet:
   - **Comprehensive unit test:** I've relied on `ITConversionController` to 
validate the code so far. Once we are good with the approach, I'll cover the 
implementation with more tests.
   - **Incremental Sync:** I've only implemented the `ConversionSource` methods 
for the snapshot sync. Once we are good with the approach, I'll implement the 
incremental sync methods.
   - **Target to Paimon:** I've only focused on implementing Paimon as a 
source. The target implementation will be out of scope for this contribution, 
if that's okay.
   
   Things where I need help:
   ### **Hudi partitions and Paimon buckets** 
   
   Paimon has buckets as another folder division within partitions (e.g. 
`partition=2025-09-01/bucket-0/data-file.parquet`), however, Hudi considers the 
bucket value as part of the partition value (e.g. `2025-09-01/bucket-0`). 
Looking at the code, it seems that directory structure == partition values 
assumption is pretty deep, so I wonder if there is a way to work around that, 
or we should call out as a limitation between both formats.
   
   ### **Iceberg Parquet conversion errors** 
   
   When executing the `ITConversionController#testVariousOperations` with 
source=paimon and target=iceberg, I'm facing the follow error:
   ```
   class org.apache.iceberg.shaded.org.apache.arrow.vector.IntVector cannot be 
cast to class 
org.apache.iceberg.shaded.org.apache.arrow.vector.BaseVariableWidthVector 
   ```
   From debugging, it seems the Parquet reader is trying to read the `id int` 
field value with the `VarWidthReader` class instead of the `IntegerReader`, 
causing the conversion issue:
   <img width="512" height="173" alt="image" 
src="https://github.com/user-attachments/assets/d4f98940-e0f9-4ba2-9d57-2a1403457a02";
 />
   
   Disabling vectorization doesn't help; doing so makes the error appear on the 
Spark level, indicating that there is something wrong with the schema 
conversion for Iceberg that I quite don't understand... 🤔 
   
   Thank you so much in advance for your help! 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [Draft] Paimon Source Support [incubator-xtable]

Reply via email to