Re: [DISCUSS] querying commit metadata from spark DataSource

Vinoth Chandar Mon, 01 Jun 2020 08:02:30 -0700

Also please take a look at https://issues.apache.org/jira/browse/HUDI-309.


This was an effort to make the timeline more generalized for querying (for
a different purpose).. but good to revisit now..

On Sun, May 31, 2020 at 11:04 PM vbal...@apache.org <vbal...@apache.org>
wrote:

>
> I strongly recommend using a separate datasource relation (option 1) to
> query timeline. It is elegant and fits well with spark APIs.
> Thanks.Balaji.V    On Saturday, May 30, 2020, 01:18:45 PM PDT, Vinoth
> Chandar <vin...@apache.org> wrote:
>
>  Hi satish,
>
> Are you looking for similar functionality as HoodieDatasourceHelpers?
>
> We have historically relied on cli to inspect the table, which does not
> lend it self well to programmatic access.. overall in like option 1 -
> allowing the timeline to be queryable with a standard schema does seem way
> nicer.
>
> I am wondering though if we should introduce a new view. Instead we can use
> a different data source name -
> spark.read.format(“hudi-timeline”).load(basepath). We can start by just
> allowing querying of active timeline and expand this to archive timeline?
>
> What do other Think?
>
>
>
>
> On Fri, May 29, 2020 at 2:37 PM Satish Kotha <satishko...@uber.com.invalid
> >
> wrote:
>
> > Hello folks,
> >
> > We have a use case to incrementally generate data for hudi table (say
> > 'table2')  by transforming data from other hudi table(say, table1). We
> want
> > to atomically store commit timestamps read from table1 into table2 commit
> > metadata.
> >
> > This is similar to how DeltaStreamer operates with kafka offsets.
> However,
> > DeltaStreamer is java code and can easily query kafka offset processed by
> > creating metaclient for target table. We want to use pyspark and I don't
> > see a good way to query commit metadata of table1 from DataSource.
> >
> > I'm considering making one of the below changes to hoodie to make this
> > easier.
> >
> > Option1: Add new relation in hudi-spark to query commit metadata. This
> > relation would present a 'metadata view' to query and filter metadata.
> >
> > Option2: Add other DataSource options on top of incremental querying to
> > allow fetching from source table. For example, users can specify
> > 'hoodie.consume.metadata.table: table2BasePath'  and issue incremental
> > query on table1. Then, IncrementalRelation would go read table2 metadata
> > first to identify 'consume.start.timestamp' and start incremental read on
> > table1 with that timestamp.
> >
> > Option 2 looks simpler to implement. But, seems a bit hacky because we
> are
> > reading metadata from table2 when data souce is table1.
> >
> > Option1 is a bit more complex. But, it is cleaner and not tightly coupled
> > to incremental reads. For example, use cases other than incremental reads
> > can leverage same relation to query metadata
> >
> > What do you guys think? Let me know if there are other simpler solutions.
> > Appreciate any feedback.
> >
> > Thanks
> > Satish
> >

Re: [DISCUSS] querying commit metadata from spark DataSource

Reply via email to