>Is it to separate data and metadata access?
Correct. We already have modes for querying data using format("hudi"). I
feel it will get very confusing to mix data and metadata in the same
source.. for e.g a lot of options we support for data may not even make
sense for the TimelineRelation.>This class seems like a list of static methods, I'm not seeing where these are accessed from That's the public API for obtaining this information for Scala/Java Spark. If you have a way of calling this from python through some bridge without painful bridges (e.g jython), might be a tactical solution that can meet your needs. On Mon, Jun 1, 2020 at 5:07 PM Satish Kotha <[email protected]> wrote: > Thanks for the feedback. > > What is the advantage of doing > spark.read.format(“hudi-timeline”).load(basepath) as opposed to doing new > relation? Is it to separate data and metadata access? > > Are you looking for similar functionality as HoodieDatasourceHelpers? > > > This class seems like a list of static methods, I'm not seeing where these > are accessed from. But, I need a way to query metadata details easily > in pyspark. > > > On Mon, Jun 1, 2020 at 8:02 AM Vinoth Chandar <[email protected]> wrote: > > > Also please take a look at > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HUDI-2D309&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=NLHsTFjPharIb29R1o1lWgYLCr1KIZZB4WGPt4IQnOE&s=fGOaSc8PxPJ8yqczQyzYtsqWMEXAbWdeKh-5xltbVG0&e= > > . > > > > This was an effort to make the timeline more generalized for querying > (for > > a different purpose).. but good to revisit now.. > > > > On Sun, May 31, 2020 at 11:04 PM [email protected] <[email protected]> > > wrote: > > > > > > > > I strongly recommend using a separate datasource relation (option 1) to > > > query timeline. It is elegant and fits well with spark APIs. > > > Thanks.Balaji.V On Saturday, May 30, 2020, 01:18:45 PM PDT, Vinoth > > > Chandar <[email protected]> wrote: > > > > > > Hi satish, > > > > > > Are you looking for similar functionality as HoodieDatasourceHelpers? > > > > > > We have historically relied on cli to inspect the table, which does not > > > lend it self well to programmatic access.. overall in like option 1 - > > > allowing the timeline to be queryable with a standard schema does seem > > way > > > nicer. > > > > > > I am wondering though if we should introduce a new view. Instead we can > > use > > > a different data source name - > > > spark.read.format(“hudi-timeline”).load(basepath). We can start by just > > > allowing querying of active timeline and expand this to archive > timeline? > > > > > > What do other Think? > > > > > > > > > > > > > > > On Fri, May 29, 2020 at 2:37 PM Satish Kotha > > <[email protected] > > > > > > > wrote: > > > > > > > Hello folks, > > > > > > > > We have a use case to incrementally generate data for hudi table (say > > > > 'table2') by transforming data from other hudi table(say, table1). > We > > > want > > > > to atomically store commit timestamps read from table1 into table2 > > commit > > > > metadata. > > > > > > > > This is similar to how DeltaStreamer operates with kafka offsets. > > > However, > > > > DeltaStreamer is java code and can easily query kafka offset > processed > > by > > > > creating metaclient for target table. We want to use pyspark and I > > don't > > > > see a good way to query commit metadata of table1 from DataSource. > > > > > > > > I'm considering making one of the below changes to hoodie to make > this > > > > easier. > > > > > > > > Option1: Add new relation in hudi-spark to query commit metadata. > This > > > > relation would present a 'metadata view' to query and filter > metadata. > > > > > > > > Option2: Add other DataSource options on top of incremental querying > to > > > > allow fetching from source table. For example, users can specify > > > > 'hoodie.consume.metadata.table: table2BasePath' and issue > incremental > > > > query on table1. Then, IncrementalRelation would go read table2 > > metadata > > > > first to identify 'consume.start.timestamp' and start incremental > read > > on > > > > table1 with that timestamp. > > > > > > > > Option 2 looks simpler to implement. But, seems a bit hacky because > we > > > are > > > > reading metadata from table2 when data souce is table1. > > > > > > > > Option1 is a bit more complex. But, it is cleaner and not tightly > > coupled > > > > to incremental reads. For example, use cases other than incremental > > reads > > > > can leverage same relation to query metadata > > > > > > > > What do you guys think? Let me know if there are other simpler > > solutions. > > > > Appreciate any feedback. > > > > > > > > Thanks > > > > Satish > > > > > > >
