Yes, Vinoth, it does go a bit too far with first class support on these
data.
A global error table can do the job easily. As we discussed yesterday,
parallel local error tables with `_errors` suffix could also benefit for
some scenarios, like different product teams manage their own tables or in
2B case where customers manage their own data. These would prefer good
segregation on errors or other related data. Let me note down the points in
RFC-20 for further discussion. Thanks for the feedback!

On Wed, Jun 3, 2020 at 9:31 PM Vinoth Chandar <vin...@apache.org> wrote:

> Hi Raymond,
>
> I am not sure generalizing this to all metadata like - errors and metrics -
> would be a good idea. We can certainly implement logging errors to a common
> errors hudi table, with a certain schema. But these can be just regular
> “hudi” format tables.
>
> Unlike the timeline metadata, these are really external data, not related
> to a given table’ core functioning.. we don’t necessarily want to keep one
> error table per hudi table..
>
> Thoughts?
>
> On Tue, Jun 2, 2020 at 5:34 PM Shiyan Xu <xu.shiyan.raym...@gmail.com>
> wrote:
>
> > I also encountered use cases where I'd like to programmatically query
> > metadata.
> > +1 on the idea of format(“hudi-timeline”)
> >
> > I also feel that the metadata can be extended further to include more
> info
> > like, errors, metrics/write statistics, etc. Like the newly proposed
> error
> > handling, we could also store all metrics or write stats there too, and
> > relate them to the timeline actions.
> >
> > A potential use case could be, with all these info encapsulated within
> > metadata, we may be able to derive some insightful results (by check
> > against some benchmarks) and answer questions like: does table A need
> more
> > tuning? does table B exceed error budget?
> >
> > Programmatic query to these metadata can help manage many tables in
> > diagnosis and inspection. We may need different read formats like
> > format("hudi-errors") or format("hudi-metrics")
> >
> > Sorry this sidetracked from the original question..These are really rough
> > high-level thoughts, and may have sign of over-engineering. Would like to
> > hear some feedbacks. Thanks.
> >
> >
> >
> >
> > On Mon, Jun 1, 2020 at 9:28 PM Satish Kotha <satishko...@uber.com.invalid
> >
> > wrote:
> >
> > > Got it. I'll look into implementation choices for creating a new data
> > > source. Appreciate all the feedback.
> > >
> > > On Mon, Jun 1, 2020 at 7:53 PM Vinoth Chandar <vin...@apache.org>
> wrote:
> > >
> > > > >Is it to separate data and metadata access?
> > > > Correct. We already have modes for querying data using
> format("hudi").
> > I
> > > > feel it will get very confusing to mix data and metadata in the same
> > > > source.. for e.g a lot of options we support for data may not even
> make
> > > > sense for the TimelineRelation.
> > > >
> > > > >This class seems like a list of static methods, I'm not seeing where
> > > these
> > > > are accessed from
> > > > That's the public API for obtaining this information for Scala/Java
> > > Spark.
> > > > If you have a way of calling this from python through some bridge
> > without
> > > > painful bridges (e.g jython), might be a tactical solution that can
> > meet
> > > > your needs.
> > > >
> > > > On Mon, Jun 1, 2020 at 5:07 PM Satish Kotha
> > <satishko...@uber.com.invalid
> > > >
> > > > wrote:
> > > >
> > > > > Thanks for the feedback.
> > > > >
> > > > > What is the advantage of doing
> > > > > spark.read.format(“hudi-timeline”).load(basepath) as opposed to
> doing
> > > new
> > > > > relation? Is it to separate data and metadata access?
> > > > >
> > > > > Are you looking for similar functionality as
> HoodieDatasourceHelpers?
> > > > > >
> > > > > This class seems like a list of static methods, I'm not seeing
> where
> > > > these
> > > > > are accessed from. But, I need a way to query metadata details
> easily
> > > > > in pyspark.
> > > > >
> > > > >
> > > > > On Mon, Jun 1, 2020 at 8:02 AM Vinoth Chandar <vin...@apache.org>
> > > wrote:
> > > > >
> > > > > > Also please take a look at
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HUDI-2D309&d=DwIFaQ&c=r2dcLCtU9q6n0vrtnDw9vg&r=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c&m=NLHsTFjPharIb29R1o1lWgYLCr1KIZZB4WGPt4IQnOE&s=fGOaSc8PxPJ8yqczQyzYtsqWMEXAbWdeKh-5xltbVG0&e=
> > > > > > .
> > > > > >
> > > > > > This was an effort to make the timeline more generalized for
> > querying
> > > > > (for
> > > > > > a different purpose).. but good to revisit now..
> > > > > >
> > > > > > On Sun, May 31, 2020 at 11:04 PM vbal...@apache.org <
> > > > vbal...@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > > >
> > > > > > > I strongly recommend using a separate datasource relation
> (option
> > > 1)
> > > > to
> > > > > > > query timeline. It is elegant and fits well with spark APIs.
> > > > > > > Thanks.Balaji.V    On Saturday, May 30, 2020, 01:18:45 PM PDT,
> > > Vinoth
> > > > > > > Chandar <vin...@apache.org> wrote:
> > > > > > >
> > > > > > >  Hi satish,
> > > > > > >
> > > > > > > Are you looking for similar functionality as
> > > HoodieDatasourceHelpers?
> > > > > > >
> > > > > > > We have historically relied on cli to inspect the table, which
> > does
> > > > not
> > > > > > > lend it self well to programmatic access.. overall in like
> option
> > > 1 -
> > > > > > > allowing the timeline to be queryable with a standard schema
> does
> > > > seem
> > > > > > way
> > > > > > > nicer.
> > > > > > >
> > > > > > > I am wondering though if we should introduce a new view.
> Instead
> > we
> > > > can
> > > > > > use
> > > > > > > a different data source name -
> > > > > > > spark.read.format(“hudi-timeline”).load(basepath). We can start
> > by
> > > > just
> > > > > > > allowing querying of active timeline and expand this to archive
> > > > > timeline?
> > > > > > >
> > > > > > > What do other Think?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, May 29, 2020 at 2:37 PM Satish Kotha
> > > > > > <satishko...@uber.com.invalid
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hello folks,
> > > > > > > >
> > > > > > > > We have a use case to incrementally generate data for hudi
> > table
> > > > (say
> > > > > > > > 'table2')  by transforming data from other hudi table(say,
> > > table1).
> > > > > We
> > > > > > > want
> > > > > > > > to atomically store commit timestamps read from table1 into
> > > table2
> > > > > > commit
> > > > > > > > metadata.
> > > > > > > >
> > > > > > > > This is similar to how DeltaStreamer operates with kafka
> > offsets.
> > > > > > > However,
> > > > > > > > DeltaStreamer is java code and can easily query kafka offset
> > > > > processed
> > > > > > by
> > > > > > > > creating metaclient for target table. We want to use pyspark
> > and
> > > I
> > > > > > don't
> > > > > > > > see a good way to query commit metadata of table1 from
> > > DataSource.
> > > > > > > >
> > > > > > > > I'm considering making one of the below changes to hoodie to
> > make
> > > > > this
> > > > > > > > easier.
> > > > > > > >
> > > > > > > > Option1: Add new relation in hudi-spark to query commit
> > metadata.
> > > > > This
> > > > > > > > relation would present a 'metadata view' to query and filter
> > > > > metadata.
> > > > > > > >
> > > > > > > > Option2: Add other DataSource options on top of incremental
> > > > querying
> > > > > to
> > > > > > > > allow fetching from source table. For example, users can
> > specify
> > > > > > > > 'hoodie.consume.metadata.table: table2BasePath'  and issue
> > > > > incremental
> > > > > > > > query on table1. Then, IncrementalRelation would go read
> table2
> > > > > > metadata
> > > > > > > > first to identify 'consume.start.timestamp' and start
> > incremental
> > > > > read
> > > > > > on
> > > > > > > > table1 with that timestamp.
> > > > > > > >
> > > > > > > > Option 2 looks simpler to implement. But, seems a bit hacky
> > > because
> > > > > we
> > > > > > > are
> > > > > > > > reading metadata from table2 when data souce is table1.
> > > > > > > >
> > > > > > > > Option1 is a bit more complex. But, it is cleaner and not
> > tightly
> > > > > > coupled
> > > > > > > > to incremental reads. For example, use cases other than
> > > incremental
> > > > > > reads
> > > > > > > > can leverage same relation to query metadata
> > > > > > > >
> > > > > > > > What do you guys think? Let me know if there are other
> simpler
> > > > > > solutions.
> > > > > > > > Appreciate any feedback.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Satish
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to