Hello Apache XTable (Incubating) Community,
There's a new feature request from the community [
https://github.com/apache/incubator-xtable/issues/550] that aims to extend
Apache XTable’s (Incubating) support for converting Parquet files to modern
table formats (like Hudi, Iceberg, and Delta) without requiring data
rewriting, enabling continuous metadata addition. This feature has
significant potential, especially for systems that produce Parquet files
exclusively and need a seamless path to incorporate them into modern table
formats.
*Proposed Design*
The development will involve creating a Parquet source class in XTable,
which would handle two main operations:
1.
*Retrieve Snapshot*: List all Parquet files in ObjectStorage or HDFS
root path to capture a snapshot. This can be achieved through a simple list
operation.
2.
*Retrieve Change Log Since Last Sync*:
- *Using List Files*: The class would retrieve Parquet files added since
the last sync time by filtering based on creationTime. While
straightforward, this approach may be expensive due to resource demands.
- *Using Cloud Notifications Queue*: For object stores, setting up a
cloud-based notifications queue is a more efficient solution.
The queue can
push file location and creationTime metadata to an events table, with
the Parquet file location as the primary key to manage duplicates. This
events table could be implemented in Hudi, Delta, or Iceberg,
which allows
the XTable Parquet source class to query the table incrementally and
retrieve new files for metadata generation. For HDFS or object
stores which
don't support a queue based system for file notifications, we need to
build/re-use existing queue implementation for file notifications.
I was thinking of the above approach, any inputs/feedback from the
community who are interested in collaborating on the design and
implementation of this feature to respond to this email or join the
discussion directly on GitHub. Your input, whether in design suggestions or
implementation support, would be appreciated.
Thanks
Vinish