[DISCUSS] querying commit metadata from spark DataSource

Satish Kotha Fri, 29 May 2020 14:37:49 -0700

Hello folks,

We have a use case to incrementally generate data for hudi table (say
'table2')  by transforming data from other hudi table(say, table1). We want
to atomically store commit timestamps read from table1 into table2 commit
metadata.


This is similar to how DeltaStreamer operates with kafka offsets. However,
DeltaStreamer is java code and can easily query kafka offset processed by
creating metaclient for target table. We want to use pyspark and I don't
see a good way to query commit metadata of table1 from DataSource.

I'm considering making one of the below changes to hoodie to make this
easier.

Option1: Add new relation in hudi-spark to query commit metadata. This
relation would present a 'metadata view' to query and filter metadata.

Option2: Add other DataSource options on top of incremental querying to
allow fetching from source table. For example, users can specify
'hoodie.consume.metadata.table: table2BasePath'  and issue incremental
query on table1. Then, IncrementalRelation would go read table2 metadata
first to identify 'consume.start.timestamp' and start incremental read on
table1 with that timestamp.

Option 2 looks simpler to implement. But, seems a bit hacky because we are
reading metadata from table2 when data souce is table1.

Option1 is a bit more complex. But, it is cleaner and not tightly coupled
to incremental reads. For example, use cases other than incremental reads
can leverage same relation to query metadata

What do you guys think? Let me know if there are other simpler solutions.
Appreciate any feedback.

Thanks
Satish

[DISCUSS] querying commit metadata from spark DataSource

Reply via email to