Also please take a look at https://issues.apache.org/jira/browse/HUDI-309.
This was an effort to make the timeline more generalized for querying (for a different purpose).. but good to revisit now.. On Sun, May 31, 2020 at 11:04 PM vbal...@apache.org <vbal...@apache.org> wrote: > > I strongly recommend using a separate datasource relation (option 1) to > query timeline. It is elegant and fits well with spark APIs. > Thanks.Balaji.V On Saturday, May 30, 2020, 01:18:45 PM PDT, Vinoth > Chandar <vin...@apache.org> wrote: > > Hi satish, > > Are you looking for similar functionality as HoodieDatasourceHelpers? > > We have historically relied on cli to inspect the table, which does not > lend it self well to programmatic access.. overall in like option 1 - > allowing the timeline to be queryable with a standard schema does seem way > nicer. > > I am wondering though if we should introduce a new view. Instead we can use > a different data source name - > spark.read.format(“hudi-timeline”).load(basepath). We can start by just > allowing querying of active timeline and expand this to archive timeline? > > What do other Think? > > > > > On Fri, May 29, 2020 at 2:37 PM Satish Kotha <satishko...@uber.com.invalid > > > wrote: > > > Hello folks, > > > > We have a use case to incrementally generate data for hudi table (say > > 'table2') by transforming data from other hudi table(say, table1). We > want > > to atomically store commit timestamps read from table1 into table2 commit > > metadata. > > > > This is similar to how DeltaStreamer operates with kafka offsets. > However, > > DeltaStreamer is java code and can easily query kafka offset processed by > > creating metaclient for target table. We want to use pyspark and I don't > > see a good way to query commit metadata of table1 from DataSource. > > > > I'm considering making one of the below changes to hoodie to make this > > easier. > > > > Option1: Add new relation in hudi-spark to query commit metadata. This > > relation would present a 'metadata view' to query and filter metadata. > > > > Option2: Add other DataSource options on top of incremental querying to > > allow fetching from source table. For example, users can specify > > 'hoodie.consume.metadata.table: table2BasePath' and issue incremental > > query on table1. Then, IncrementalRelation would go read table2 metadata > > first to identify 'consume.start.timestamp' and start incremental read on > > table1 with that timestamp. > > > > Option 2 looks simpler to implement. But, seems a bit hacky because we > are > > reading metadata from table2 when data souce is table1. > > > > Option1 is a bit more complex. But, it is cleaner and not tightly coupled > > to incremental reads. For example, use cases other than incremental reads > > can leverage same relation to query metadata > > > > What do you guys think? Let me know if there are other simpler solutions. > > Appreciate any feedback. > > > > Thanks > > Satish > >