Re: [DISCUSS] querying commit metadata from spark DataSource

vbal...@apache.org Sun, 31 May 2020 23:04:58 -0700

 
I strongly recommend using a separate datasource relation (option 1) to query 
timeline. It is elegant and fits well with spark APIs.
Thanks.Balaji.V    On Saturday, May 30, 2020, 01:18:45 PM PDT, Vinoth Chandar 
<vin...@apache.org> wrote:  
 
 Hi satish,


Are you looking for similar functionality as HoodieDatasourceHelpers?

We have historically relied on cli to inspect the table, which does not
lend it self well to programmatic access.. overall in like option 1 -
allowing the timeline to be queryable with a standard schema does seem way
nicer.

I am wondering though if we should introduce a new view. Instead we can use
a different data source name -
spark.read.format(“hudi-timeline”).load(basepath). We can start by just
allowing querying of active timeline and expand this to archive timeline?

What do other Think?




On Fri, May 29, 2020 at 2:37 PM Satish Kotha <satishko...@uber.com.invalid>
wrote:

> Hello folks,
>
> We have a use case to incrementally generate data for hudi table (say
> 'table2')  by transforming data from other hudi table(say, table1). We want
> to atomically store commit timestamps read from table1 into table2 commit
> metadata.
>
> This is similar to how DeltaStreamer operates with kafka offsets. However,
> DeltaStreamer is java code and can easily query kafka offset processed by
> creating metaclient for target table. We want to use pyspark and I don't
> see a good way to query commit metadata of table1 from DataSource.
>
> I'm considering making one of the below changes to hoodie to make this
> easier.
>
> Option1: Add new relation in hudi-spark to query commit metadata. This
> relation would present a 'metadata view' to query and filter metadata.
>
> Option2: Add other DataSource options on top of incremental querying to
> allow fetching from source table. For example, users can specify
> 'hoodie.consume.metadata.table: table2BasePath'  and issue incremental
> query on table1. Then, IncrementalRelation would go read table2 metadata
> first to identify 'consume.start.timestamp' and start incremental read on
> table1 with that timestamp.
>
> Option 2 looks simpler to implement. But, seems a bit hacky because we are
> reading metadata from table2 when data souce is table1.
>
> Option1 is a bit more complex. But, it is cleaner and not tightly coupled
> to incremental reads. For example, use cases other than incremental reads
> can leverage same relation to query metadata
>
> What do you guys think? Let me know if there are other simpler solutions.
> Appreciate any feedback.
>
> Thanks
> Satish
>

Re: [DISCUSS] querying commit metadata from spark DataSource

Reply via email to