Hello Apache XTable (Incubating) Community,

There's a new feature request from the community [
https://github.com/apache/incubator-xtable/issues/550] that aims to extend
Apache XTable’s (Incubating) support for converting Parquet files to modern
table formats (like Hudi, Iceberg, and Delta) without requiring data
rewriting, enabling continuous metadata addition. This feature has
significant potential, especially for systems that produce Parquet files
exclusively and need a seamless path to incorporate them into modern table
formats.

*Proposed Design*
The development will involve creating a Parquet source class in XTable,
which would handle two main operations:


   1.

   *Retrieve Snapshot*: List all Parquet files in ObjectStorage or HDFS
   root path to capture a snapshot. This can be achieved through a simple list
   operation.
   2.

   *Retrieve Change Log Since Last Sync*:
   - *Using List Files*: The class would retrieve Parquet files added since
      the last sync time by filtering based on creationTime. While
      straightforward, this approach may be expensive due to resource demands.
      - *Using Cloud Notifications Queue*: For object stores, setting up a
      cloud-based notifications queue is a more efficient solution.
The queue can
      push file location and creationTime metadata to an events table, with
      the Parquet file location as the primary key to manage duplicates. This
      events table could be implemented in Hudi, Delta, or Iceberg,
which allows
      the XTable Parquet source class to query the table incrementally and
      retrieve new files for metadata generation. For HDFS or object
stores which
      don't support a queue based system for file notifications, we need to
      build/re-use existing queue implementation for file notifications.


I was thinking of the above approach, any inputs/feedback from the
community who are interested in collaborating on the design and
implementation of this feature to respond to this email or join the
discussion directly on GitHub. Your input, whether in design suggestions or
implementation support, would be appreciated.

Thanks
Vinish

Reply via email to