Re: [DISCUSS] Utility tool for syncing parquet files to all three formats delta, hudi, iceberg

Vinish Reddy Pannala Tue, 29 Oct 2024 13:55:30 -0700

Thanks for the feedback Tim.

Yes we can avoid the maintenance of the events table (stores all file
notifications from the queue) to keep the design simple for append-only
systems. If we want to support parquet files being deleted from storage or
versioned parquet objects, do we introduce the events table now or later
needs to be discussed. IMO having an events table earlier would help us in
re-processing or audit as well.


-Vinish

On Sun, Oct 27, 2024 at 1:32 PM Tim Brown <tim.brown...@gmail.com> wrote:

> Hi Vinish,
>
> I think this is a great idea to aid in the migration of existing
> append-only systems to modern table-formats.
>
> Regarding the implementation, I don't think that there needs to be an
> additional events table when using a queue. You can simply consume from the
> queue and directly update the metadata in the target table formats.
>
> -Tim
>
>
> On Fri, Oct 25, 2024 at 6:06 PM Vinish Reddy <vin...@apache.org> wrote:
>
> > Hello Apache XTable (Incubating) Community,
> >
> > There's a new feature request from the community [
> > https://github.com/apache/incubator-xtable/issues/550] that aims to
> extend
> > Apache XTable’s (Incubating) support for converting Parquet files to
> modern
> > table formats (like Hudi, Iceberg, and Delta) without requiring data
> > rewriting, enabling continuous metadata addition. This feature has
> > significant potential, especially for systems that produce Parquet files
> > exclusively and need a seamless path to incorporate them into modern
> table
> > formats.
> >
> > *Proposed Design*
> > The development will involve creating a Parquet source class in XTable,
> > which would handle two main operations:
> >
> >
> >    1.
> >
> >    *Retrieve Snapshot*: List all Parquet files in ObjectStorage or HDFS
> >    root path to capture a snapshot. This can be achieved through a simple
> > list
> >    operation.
> >    2.
> >
> >    *Retrieve Change Log Since Last Sync*:
> >    - *Using List Files*: The class would retrieve Parquet files added
> since
> >       the last sync time by filtering based on creationTime. While
> >       straightforward, this approach may be expensive due to resource
> > demands.
> >       - *Using Cloud Notifications Queue*: For object stores, setting up
> a
> >       cloud-based notifications queue is a more efficient solution.
> > The queue can
> >       push file location and creationTime metadata to an events table,
> with
> >       the Parquet file location as the primary key to manage duplicates.
> > This
> >       events table could be implemented in Hudi, Delta, or Iceberg,
> > which allows
> >       the XTable Parquet source class to query the table incrementally
> and
> >       retrieve new files for metadata generation. For HDFS or object
> > stores which
> >       don't support a queue based system for file notifications, we need
> to
> >       build/re-use existing queue implementation for file notifications.
> >
> >
> > I was thinking of the above approach, any inputs/feedback from the
> > community who are interested in collaborating on the design and
> > implementation of this feature to respond to this email or join the
> > discussion directly on GitHub. Your input, whether in design suggestions
> or
> > implementation support, would be appreciated.
> >
> > Thanks
> > Vinish
> >
>

Re: [DISCUSS] Utility tool for syncing parquet files to all three formats delta, hudi, iceberg

Reply via email to