Re: [DISCUSS] Utility tool for syncing parquet files to all three formats delta, hudi, iceberg

Tim Brown Sun, 27 Oct 2024 13:33:17 -0700

Hi Vinish,

I think this is a great idea to aid in the migration of existing
append-only systems to modern table-formats.


Regarding the implementation, I don't think that there needs to be an
additional events table when using a queue. You can simply consume from the
queue and directly update the metadata in the target table formats.

-Tim


On Fri, Oct 25, 2024 at 6:06 PM Vinish Reddy <[email protected]> wrote:

> Hello Apache XTable (Incubating) Community,
>
> There's a new feature request from the community [
> https://github.com/apache/incubator-xtable/issues/550] that aims to extend
> Apache XTable’s (Incubating) support for converting Parquet files to modern
> table formats (like Hudi, Iceberg, and Delta) without requiring data
> rewriting, enabling continuous metadata addition. This feature has
> significant potential, especially for systems that produce Parquet files
> exclusively and need a seamless path to incorporate them into modern table
> formats.
>
> *Proposed Design*
> The development will involve creating a Parquet source class in XTable,
> which would handle two main operations:
>
>
>    1.
>
>    *Retrieve Snapshot*: List all Parquet files in ObjectStorage or HDFS
>    root path to capture a snapshot. This can be achieved through a simple
> list
>    operation.
>    2.
>
>    *Retrieve Change Log Since Last Sync*:
>    - *Using List Files*: The class would retrieve Parquet files added since
>       the last sync time by filtering based on creationTime. While
>       straightforward, this approach may be expensive due to resource
> demands.
>       - *Using Cloud Notifications Queue*: For object stores, setting up a
>       cloud-based notifications queue is a more efficient solution.
> The queue can
>       push file location and creationTime metadata to an events table, with
>       the Parquet file location as the primary key to manage duplicates.
> This
>       events table could be implemented in Hudi, Delta, or Iceberg,
> which allows
>       the XTable Parquet source class to query the table incrementally and
>       retrieve new files for metadata generation. For HDFS or object
> stores which
>       don't support a queue based system for file notifications, we need to
>       build/re-use existing queue implementation for file notifications.
>
>
> I was thinking of the above approach, any inputs/feedback from the
> community who are interested in collaborating on the design and
> implementation of this feature to respond to this email or join the
> discussion directly on GitHub. Your input, whether in design suggestions or
> implementation support, would be appreciated.
>
> Thanks
> Vinish
>

Re: [DISCUSS] Utility tool for syncing parquet files to all three formats delta, hudi, iceberg

Reply via email to