Hi Vinish, I think this is a great idea to aid in the migration of existing append-only systems to modern table-formats.
Regarding the implementation, I don't think that there needs to be an additional events table when using a queue. You can simply consume from the queue and directly update the metadata in the target table formats. -Tim On Fri, Oct 25, 2024 at 6:06 PM Vinish Reddy <vin...@apache.org> wrote: > Hello Apache XTable (Incubating) Community, > > There's a new feature request from the community [ > https://github.com/apache/incubator-xtable/issues/550] that aims to extend > Apache XTable’s (Incubating) support for converting Parquet files to modern > table formats (like Hudi, Iceberg, and Delta) without requiring data > rewriting, enabling continuous metadata addition. This feature has > significant potential, especially for systems that produce Parquet files > exclusively and need a seamless path to incorporate them into modern table > formats. > > *Proposed Design* > The development will involve creating a Parquet source class in XTable, > which would handle two main operations: > > > 1. > > *Retrieve Snapshot*: List all Parquet files in ObjectStorage or HDFS > root path to capture a snapshot. This can be achieved through a simple > list > operation. > 2. > > *Retrieve Change Log Since Last Sync*: > - *Using List Files*: The class would retrieve Parquet files added since > the last sync time by filtering based on creationTime. While > straightforward, this approach may be expensive due to resource > demands. > - *Using Cloud Notifications Queue*: For object stores, setting up a > cloud-based notifications queue is a more efficient solution. > The queue can > push file location and creationTime metadata to an events table, with > the Parquet file location as the primary key to manage duplicates. > This > events table could be implemented in Hudi, Delta, or Iceberg, > which allows > the XTable Parquet source class to query the table incrementally and > retrieve new files for metadata generation. For HDFS or object > stores which > don't support a queue based system for file notifications, we need to > build/re-use existing queue implementation for file notifications. > > > I was thinking of the above approach, any inputs/feedback from the > community who are interested in collaborating on the design and > implementation of this feature to respond to this email or join the > discussion directly on GitHub. Your input, whether in design suggestions or > implementation support, would be appreciated. > > Thanks > Vinish >