mikedias opened a new pull request, #742: URL: https://github.com/apache/incubator-xtable/pull/742
Issue: https://github.com/apache/incubator-xtable/issues/275 Draft PR to add support for [Apache Paimon](https://paimon.apache.org/) as the source table. This is an early sharing to collect feedback on whether the approach is right and on a few issues I've encountered. 🙂 Things that are missing yet: - **Comprehensive unit test:** I've relied on `ITConversionController` to validate the code so far. Once we are good with the approach, I'll cover the implementation with more tests. - **Incremental Sync:** I've only implemented the `ConversionSource` methods for the snapshot sync. Once we are good with the approach, I'll implement the incremental sync methods. - **Target to Paimon:** I've only focused on implementing Paimon as a source. The target implementation will be out of scope for this contribution, if that's okay. Things where I need help: ### **Hudi partitions and Paimon buckets** Paimon has buckets as another folder division within partitions (e.g. `partition=2025-09-01/bucket-0/data-file.parquet`), however, Hudi considers the bucket value as part of the partition value (e.g. `2025-09-01/bucket-0`). Looking at the code, it seems that directory structure == partition values assumption is pretty deep, so I wonder if there is a way to work around that, or we should call out as a limitation between both formats. ### **Iceberg Parquet conversion errors** When executing the `ITConversionController#testVariousOperations` with source=paimon and target=iceberg, I'm facing the follow error: ``` class org.apache.iceberg.shaded.org.apache.arrow.vector.IntVector cannot be cast to class org.apache.iceberg.shaded.org.apache.arrow.vector.BaseVariableWidthVector ``` From debugging, it seems the Parquet reader is trying to read the `id int` field value with the `VarWidthReader` class instead of the `IntegerReader`, causing the conversion issue: <img width="512" height="173" alt="image" src="https://github.com/user-attachments/assets/d4f98940-e0f9-4ba2-9d57-2a1403457a02" /> Disabling vectorization doesn't help; doing so makes the error appear on the Spark level, indicating that there is something wrong with the schema conversion for Iceberg that I quite don't understand... 🤔 Thank you so much in advance for your help! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
