date:20200601

Re: [DISCUSS] querying commit metadata from spark DataSource

2020-06-01 Thread Satish Kotha

Got it. I'll look into implementation choices for creating a new data
source. Appreciate all the feedback.

On Mon, Jun 1, 2020 at 7:53 PM Vinoth Chandar  wrote:

> >Is it to separate data and metadata access?
> Correct. We already have modes for querying data using format("hudi"). I
> feel it will get very confusing to mix data and metadata in the same
> source.. for e.g a lot of options we support for data may not even make
> sense for the TimelineRelation.
>
> >This class seems like a list of static methods, I'm not seeing where these
> are accessed from
> That's the public API for obtaining this information for Scala/Java Spark.
> If you have a way of calling this from python through some bridge without
> painful bridges (e.g jython), might be a tactical solution that can meet
> your needs.
>
> On Mon, Jun 1, 2020 at 5:07 PM Satish Kotha 
> wrote:
>
> > Thanks for the feedback.
> >
> > What is the advantage of doing
> > spark.read.format(“hudi-timeline”).load(basepath) as opposed to doing new
> > relation? Is it to separate data and metadata access?
> >
> > Are you looking for similar functionality as HoodieDatasourceHelpers?
> > >
> > This class seems like a list of static methods, I'm not seeing where
> these
> > are accessed from. But, I need a way to query metadata details easily
> > in pyspark.
> >
> >
> > On Mon, Jun 1, 2020 at 8:02 AM Vinoth Chandar  wrote:
> >
> > > Also please take a look at
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HUDI-2D309=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c=NLHsTFjPharIb29R1o1lWgYLCr1KIZZB4WGPt4IQnOE=fGOaSc8PxPJ8yqczQyzYtsqWMEXAbWdeKh-5xltbVG0=
> > > .
> > >
> > > This was an effort to make the timeline more generalized for querying
> > (for
> > > a different purpose).. but good to revisit now..
> > >
> > > On Sun, May 31, 2020 at 11:04 PM vbal...@apache.org <
> vbal...@apache.org>
> > > wrote:
> > >
> > > >
> > > > I strongly recommend using a separate datasource relation (option 1)
> to
> > > > query timeline. It is elegant and fits well with spark APIs.
> > > > Thanks.Balaji.VOn Saturday, May 30, 2020, 01:18:45 PM PDT, Vinoth
> > > > Chandar  wrote:
> > > >
> > > >  Hi satish,
> > > >
> > > > Are you looking for similar functionality as HoodieDatasourceHelpers?
> > > >
> > > > We have historically relied on cli to inspect the table, which does
> not
> > > > lend it self well to programmatic access.. overall in like option 1 -
> > > > allowing the timeline to be queryable with a standard schema does
> seem
> > > way
> > > > nicer.
> > > >
> > > > I am wondering though if we should introduce a new view. Instead we
> can
> > > use
> > > > a different data source name -
> > > > spark.read.format(“hudi-timeline”).load(basepath). We can start by
> just
> > > > allowing querying of active timeline and expand this to archive
> > timeline?
> > > >
> > > > What do other Think?
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, May 29, 2020 at 2:37 PM Satish Kotha
> > >  > > > >
> > > > wrote:
> > > >
> > > > > Hello folks,
> > > > >
> > > > > We have a use case to incrementally generate data for hudi table
> (say
> > > > > 'table2')  by transforming data from other hudi table(say, table1).
> > We
> > > > want
> > > > > to atomically store commit timestamps read from table1 into table2
> > > commit
> > > > > metadata.
> > > > >
> > > > > This is similar to how DeltaStreamer operates with kafka offsets.
> > > > However,
> > > > > DeltaStreamer is java code and can easily query kafka offset
> > processed
> > > by
> > > > > creating metaclient for target table. We want to use pyspark and I
> > > don't
> > > > > see a good way to query commit metadata of table1 from DataSource.
> > > > >
> > > > > I'm considering making one of the below changes to hoodie to make
> > this
> > > > > easier.
> > > > >
> > > > > Option1: Add new relation in hudi-spark to query commit metadata.
> > This
> > > > > relation would present a 'metadata view' to query and filter
> > metadata.
> > > > >
> > > > > Option2: Add other DataSource options on top of incremental
> querying
> > to
> > > > > allow fetching from source table. For example, users can specify
> > > > > 'hoodie.consume.metadata.table: table2BasePath'  and issue
> > incremental
> > > > > query on table1. Then, IncrementalRelation would go read table2
> > > metadata
> > > > > first to identify 'consume.start.timestamp' and start incremental
> > read
> > > on
> > > > > table1 with that timestamp.
> > > > >
> > > > > Option 2 looks simpler to implement. But, seems a bit hacky because
> > we
> > > > are
> > > > > reading metadata from table2 when data souce is table1.
> > > > >
> > > > > Option1 is a bit more complex. But, it is cleaner and not tightly
> > > coupled
> > > > > to incremental reads. For example, use cases other than incremental
> > > reads
> > > > > can leverage same relation to query metadata
> > > > >
> > > > > What do you guys think? Let

Re: [DISCUSS] querying commit metadata from spark DataSource

2020-06-01 Thread Vinoth Chandar

>Is it to separate data and metadata access?
Correct. We already have modes for querying data using format("hudi"). I
feel it will get very confusing to mix data and metadata in the same
source.. for e.g a lot of options we support for data may not even make
sense for the TimelineRelation.

>This class seems like a list of static methods, I'm not seeing where these
are accessed from
That's the public API for obtaining this information for Scala/Java Spark.
If you have a way of calling this from python through some bridge without
painful bridges (e.g jython), might be a tactical solution that can meet
your needs.

On Mon, Jun 1, 2020 at 5:07 PM Satish Kotha 
wrote:

> Thanks for the feedback.
>
> What is the advantage of doing
> spark.read.format(“hudi-timeline”).load(basepath) as opposed to doing new
> relation? Is it to separate data and metadata access?
>
> Are you looking for similar functionality as HoodieDatasourceHelpers?
> >
> This class seems like a list of static methods, I'm not seeing where these
> are accessed from. But, I need a way to query metadata details easily
> in pyspark.
>
>
> On Mon, Jun 1, 2020 at 8:02 AM Vinoth Chandar  wrote:
>
> > Also please take a look at
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HUDI-2D309=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c=NLHsTFjPharIb29R1o1lWgYLCr1KIZZB4WGPt4IQnOE=fGOaSc8PxPJ8yqczQyzYtsqWMEXAbWdeKh-5xltbVG0=
> > .
> >
> > This was an effort to make the timeline more generalized for querying
> (for
> > a different purpose).. but good to revisit now..
> >
> > On Sun, May 31, 2020 at 11:04 PM vbal...@apache.org 
> > wrote:
> >
> > >
> > > I strongly recommend using a separate datasource relation (option 1) to
> > > query timeline. It is elegant and fits well with spark APIs.
> > > Thanks.Balaji.VOn Saturday, May 30, 2020, 01:18:45 PM PDT, Vinoth
> > > Chandar  wrote:
> > >
> > >  Hi satish,
> > >
> > > Are you looking for similar functionality as HoodieDatasourceHelpers?
> > >
> > > We have historically relied on cli to inspect the table, which does not
> > > lend it self well to programmatic access.. overall in like option 1 -
> > > allowing the timeline to be queryable with a standard schema does seem
> > way
> > > nicer.
> > >
> > > I am wondering though if we should introduce a new view. Instead we can
> > use
> > > a different data source name -
> > > spark.read.format(“hudi-timeline”).load(basepath). We can start by just
> > > allowing querying of active timeline and expand this to archive
> timeline?
> > >
> > > What do other Think?
> > >
> > >
> > >
> > >
> > > On Fri, May 29, 2020 at 2:37 PM Satish Kotha
> >  > > >
> > > wrote:
> > >
> > > > Hello folks,
> > > >
> > > > We have a use case to incrementally generate data for hudi table (say
> > > > 'table2')  by transforming data from other hudi table(say, table1).
> We
> > > want
> > > > to atomically store commit timestamps read from table1 into table2
> > commit
> > > > metadata.
> > > >
> > > > This is similar to how DeltaStreamer operates with kafka offsets.
> > > However,
> > > > DeltaStreamer is java code and can easily query kafka offset
> processed
> > by
> > > > creating metaclient for target table. We want to use pyspark and I
> > don't
> > > > see a good way to query commit metadata of table1 from DataSource.
> > > >
> > > > I'm considering making one of the below changes to hoodie to make
> this
> > > > easier.
> > > >
> > > > Option1: Add new relation in hudi-spark to query commit metadata.
> This
> > > > relation would present a 'metadata view' to query and filter
> metadata.
> > > >
> > > > Option2: Add other DataSource options on top of incremental querying
> to
> > > > allow fetching from source table. For example, users can specify
> > > > 'hoodie.consume.metadata.table: table2BasePath'  and issue
> incremental
> > > > query on table1. Then, IncrementalRelation would go read table2
> > metadata
> > > > first to identify 'consume.start.timestamp' and start incremental
> read
> > on
> > > > table1 with that timestamp.
> > > >
> > > > Option 2 looks simpler to implement. But, seems a bit hacky because
> we
> > > are
> > > > reading metadata from table2 when data souce is table1.
> > > >
> > > > Option1 is a bit more complex. But, it is cleaner and not tightly
> > coupled
> > > > to incremental reads. For example, use cases other than incremental
> > reads
> > > > can leverage same relation to query metadata
> > > >
> > > > What do you guys think? Let me know if there are other simpler
> > solutions.
> > > > Appreciate any feedback.
> > > >
> > > > Thanks
> > > > Satish
> > > >
> >
>

Re: [DISCUSS] querying commit metadata from spark DataSource

2020-06-01 Thread Satish Kotha

Thanks for the feedback.

What is the advantage of doing
spark.read.format(“hudi-timeline”).load(basepath) as opposed to doing new
relation? Is it to separate data and metadata access?

Are you looking for similar functionality as HoodieDatasourceHelpers?
>
This class seems like a list of static methods, I'm not seeing where these
are accessed from. But, I need a way to query metadata details easily
in pyspark.


On Mon, Jun 1, 2020 at 8:02 AM Vinoth Chandar  wrote:

> Also please take a look at
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HUDI-2D309=DwIFaQ=r2dcLCtU9q6n0vrtnDw9vg=4xNSsHvHqd0Eym5a_ZpDVwlq_iJaZ0Rdk0u0SMLXZ0c=NLHsTFjPharIb29R1o1lWgYLCr1KIZZB4WGPt4IQnOE=fGOaSc8PxPJ8yqczQyzYtsqWMEXAbWdeKh-5xltbVG0=
> .
>
> This was an effort to make the timeline more generalized for querying (for
> a different purpose).. but good to revisit now..
>
> On Sun, May 31, 2020 at 11:04 PM vbal...@apache.org 
> wrote:
>
> >
> > I strongly recommend using a separate datasource relation (option 1) to
> > query timeline. It is elegant and fits well with spark APIs.
> > Thanks.Balaji.VOn Saturday, May 30, 2020, 01:18:45 PM PDT, Vinoth
> > Chandar  wrote:
> >
> >  Hi satish,
> >
> > Are you looking for similar functionality as HoodieDatasourceHelpers?
> >
> > We have historically relied on cli to inspect the table, which does not
> > lend it self well to programmatic access.. overall in like option 1 -
> > allowing the timeline to be queryable with a standard schema does seem
> way
> > nicer.
> >
> > I am wondering though if we should introduce a new view. Instead we can
> use
> > a different data source name -
> > spark.read.format(“hudi-timeline”).load(basepath). We can start by just
> > allowing querying of active timeline and expand this to archive timeline?
> >
> > What do other Think?
> >
> >
> >
> >
> > On Fri, May 29, 2020 at 2:37 PM Satish Kotha
>  > >
> > wrote:
> >
> > > Hello folks,
> > >
> > > We have a use case to incrementally generate data for hudi table (say
> > > 'table2')  by transforming data from other hudi table(say, table1). We
> > want
> > > to atomically store commit timestamps read from table1 into table2
> commit
> > > metadata.
> > >
> > > This is similar to how DeltaStreamer operates with kafka offsets.
> > However,
> > > DeltaStreamer is java code and can easily query kafka offset processed
> by
> > > creating metaclient for target table. We want to use pyspark and I
> don't
> > > see a good way to query commit metadata of table1 from DataSource.
> > >
> > > I'm considering making one of the below changes to hoodie to make this
> > > easier.
> > >
> > > Option1: Add new relation in hudi-spark to query commit metadata. This
> > > relation would present a 'metadata view' to query and filter metadata.
> > >
> > > Option2: Add other DataSource options on top of incremental querying to
> > > allow fetching from source table. For example, users can specify
> > > 'hoodie.consume.metadata.table: table2BasePath'  and issue incremental
> > > query on table1. Then, IncrementalRelation would go read table2
> metadata
> > > first to identify 'consume.start.timestamp' and start incremental read
> on
> > > table1 with that timestamp.
> > >
> > > Option 2 looks simpler to implement. But, seems a bit hacky because we
> > are
> > > reading metadata from table2 when data souce is table1.
> > >
> > > Option1 is a bit more complex. But, it is cleaner and not tightly
> coupled
> > > to incremental reads. For example, use cases other than incremental
> reads
> > > can leverage same relation to query metadata
> > >
> > > What do you guys think? Let me know if there are other simpler
> solutions.
> > > Appreciate any feedback.
> > >
> > > Thanks
> > > Satish
> > >
>

Re: [DISSCUSS] Trigger a Travis-CI rebuild without pushing a commit

2020-06-01 Thread Vinoth Chandar

Great!  I left some comment on the PR. around licensing and maintenance
overhead.

On Sun, May 31, 2020 at 11:51 PM Lamber Ken  wrote:

> Hi forks,
>
> Learned from travis and github actions api docs these days, I used my
> project as a demo[1],
> the demo pull request will always fail, please use "rerun tests" command,
> it will rerun tests automatically.
>
> if you are interested, try it.
>
> Best,
> Lamber-Ken
>
> [1] https://github.com/lamber-ken/hdocs/pull/36
>
>
> On 2020/05/27 06:08:05, Lamber Ken  wrote:
> > Dear community,
> >
> > Use case: A build fails due to an externality. The source is actually
> correct. It would build OK and pass if simply re-run. Is there some way to
> nudge Travis-CI to do another build, other than pushing a "dummy" commit?
> >
> > The way I often used is `git commit --allow-empty -m 'trigger rebuild'`,
> push a dummy commit, the travis will rebuild. Also noticed some apache
> projects have supported this feature.
> >
> > For example:
> > 1. Carbondata use "retest this please"
> > https://github.com/apache/carbondata/pull/3387
> >
> > 2. Bookkeeper use "run pr validation"
> > https://github.com/apache/bookkeeper/pull/2158
> >
> > But, I can't find a effective solution from Github and Travis's
> documentation[1], any thoughts or opinions?
> >
> > Best,
> > Lamber-Ken
> >
> > [1] https://docs.travis-ci.comhttps://support.github.com
> >
>

Re: [DISCUSS] querying commit metadata from spark DataSource

2020-06-01 Thread Vinoth Chandar

Also please take a look at https://issues.apache.org/jira/browse/HUDI-309.

This was an effort to make the timeline more generalized for querying (for
a different purpose).. but good to revisit now..

On Sun, May 31, 2020 at 11:04 PM vbal...@apache.org 
wrote:

>
> I strongly recommend using a separate datasource relation (option 1) to
> query timeline. It is elegant and fits well with spark APIs.
> Thanks.Balaji.VOn Saturday, May 30, 2020, 01:18:45 PM PDT, Vinoth
> Chandar  wrote:
>
>  Hi satish,
>
> Are you looking for similar functionality as HoodieDatasourceHelpers?
>
> We have historically relied on cli to inspect the table, which does not
> lend it self well to programmatic access.. overall in like option 1 -
> allowing the timeline to be queryable with a standard schema does seem way
> nicer.
>
> I am wondering though if we should introduce a new view. Instead we can use
> a different data source name -
> spark.read.format(“hudi-timeline”).load(basepath). We can start by just
> allowing querying of active timeline and expand this to archive timeline?
>
> What do other Think?
>
>
>
>
> On Fri, May 29, 2020 at 2:37 PM Satish Kotha  >
> wrote:
>
> > Hello folks,
> >
> > We have a use case to incrementally generate data for hudi table (say
> > 'table2')  by transforming data from other hudi table(say, table1). We
> want
> > to atomically store commit timestamps read from table1 into table2 commit
> > metadata.
> >
> > This is similar to how DeltaStreamer operates with kafka offsets.
> However,
> > DeltaStreamer is java code and can easily query kafka offset processed by
> > creating metaclient for target table. We want to use pyspark and I don't
> > see a good way to query commit metadata of table1 from DataSource.
> >
> > I'm considering making one of the below changes to hoodie to make this
> > easier.
> >
> > Option1: Add new relation in hudi-spark to query commit metadata. This
> > relation would present a 'metadata view' to query and filter metadata.
> >
> > Option2: Add other DataSource options on top of incremental querying to
> > allow fetching from source table. For example, users can specify
> > 'hoodie.consume.metadata.table: table2BasePath'  and issue incremental
> > query on table1. Then, IncrementalRelation would go read table2 metadata
> > first to identify 'consume.start.timestamp' and start incremental read on
> > table1 with that timestamp.
> >
> > Option 2 looks simpler to implement. But, seems a bit hacky because we
> are
> > reading metadata from table2 when data souce is table1.
> >
> > Option1 is a bit more complex. But, it is cleaner and not tightly coupled
> > to incremental reads. For example, use cases other than incremental reads
> > can leverage same relation to query metadata
> >
> > What do you guys think? Let me know if there are other simpler solutions.
> > Appreciate any feedback.
> >
> > Thanks
> > Satish
> >

Re: How to extend the timeline server schema to accommodate business metadata

2020-06-01 Thread Vinoth Chandar

Hi Mario,

Thanks for the detailed explanation. Hudi already allows extra metadata to
be written atomically with each commit i.e write operation. In fact, that
is how we track checkpoints for our delta streamer tool.. It may not solve
the need for querying the data together with this information. but gives
you ability to do some basic tagging.. if thats useful

>>If we enable the timeline service metadata model to be extended we could
use the service instance itself to support specialised queries that involve
business qualifiers in order to return a proper set of metadata pointing to
the related commits

This is a good idea actually.. There is another active discuss thread on
making the metadata queryable.. there is also
https://issues.apache.org/jira/browse/HUDI-309 which we paused for now..
But that's more in line with what you are thinking IIUC


Thanks
vinoth

On Mon, Jun 1, 2020 at 4:41 AM Mario de Sá Vera  wrote:

> Hi Balaji,
>
> business metadata are all types of info related to the business where the
> Hudi solution is being used... from a COB (ie close of business date)
> related to that commit to any qualifier related to that commit that might
> be useful to be associated with that commit id. If we enable the timeline
> service metadata model to be extended we could use the service instance
> itself to support specialised queries that involve business qualifiers in
> order to return a proper set of metadata pointing to the related commits
> that answer a business query.
>
> if we do not have that flexibility we might end up creating a external
> transaction log and then comes the hard task to make that service in sync
> to the timeline service.
>
> let me know if that makes sense to you,
>
> Mario.
>
> Em seg., 1 de jun. de 2020 às 06:55, Balaji Varadarajan
>  escreveu:
>
> >  Hi Mario,
> > Timeline Server was designed to serve hudi metadata for Hudi writers and
> > readers.  it may not be suitable to serve arbitrary data. But, it is an
> > interesting thought. Can you elaborate more on what kind of business
> > metadata are you looking. Is this something you are planning to store in
> > commit files ?
> > Balaji.V
> >
> > On Sunday, May 31, 2020, 04:22:27 PM PDT, Mario de Sá Vera <
> > desav...@gmail.com> wrote:
> >
> >  I see a need for extending the current timeline server schema so that a
> > flexible model could be achieved in order to accommodate business
> metadata.
> >
> > let me know if that makes sense to anyone here...
> >
> > Regards,
> >
> > Mario.
> >
>

Re: How to extend the timeline server schema to accommodate business metadata

2020-06-01 Thread Mario de Sá Vera

Hi Balaji,

business metadata are all types of info related to the business where the
Hudi solution is being used... from a COB (ie close of business date)
related to that commit to any qualifier related to that commit that might
be useful to be associated with that commit id. If we enable the timeline
service metadata model to be extended we could use the service instance
itself to support specialised queries that involve business qualifiers in
order to return a proper set of metadata pointing to the related commits
that answer a business query.

if we do not have that flexibility we might end up creating a external
transaction log and then comes the hard task to make that service in sync
to the timeline service.

let me know if that makes sense to you,

Mario.

Em seg., 1 de jun. de 2020 às 06:55, Balaji Varadarajan
 escreveu:

>  Hi Mario,
> Timeline Server was designed to serve hudi metadata for Hudi writers and
> readers.  it may not be suitable to serve arbitrary data. But, it is an
> interesting thought. Can you elaborate more on what kind of business
> metadata are you looking. Is this something you are planning to store in
> commit files ?
> Balaji.V
>
> On Sunday, May 31, 2020, 04:22:27 PM PDT, Mario de Sá Vera <
> desav...@gmail.com> wrote:
>
>  I see a need for extending the current timeline server schema so that a
> flexible model could be achieved in order to accommodate business metadata.
>
> let me know if that makes sense to anyone here...
>
> Regards,
>
> Mario.
>

Re: [DISSCUSS] Trigger a Travis-CI rebuild without pushing a commit

2020-06-01 Thread Lamber Ken

Hi forks,

Learned from travis and github actions api docs these days, I used my project 
as a demo[1],
the demo pull request will always fail, please use "rerun tests" command, it 
will rerun tests automatically.

if you are interested, try it.

Best,
Lamber-Ken

[1] https://github.com/lamber-ken/hdocs/pull/36


On 2020/05/27 06:08:05, Lamber Ken  wrote: 
> Dear community,
> 
> Use case: A build fails due to an externality. The source is actually 
> correct. It would build OK and pass if simply re-run. Is there some way to 
> nudge Travis-CI to do another build, other than pushing a "dummy" commit?
> 
> The way I often used is `git commit --allow-empty -m 'trigger rebuild'`, push 
> a dummy commit, the travis will rebuild. Also noticed some apache projects 
> have supported this feature.
> 
> For example:
> 1. Carbondata use "retest this please"
> https://github.com/apache/carbondata/pull/3387
> 
> 2. Bookkeeper use "run pr validation"
> https://github.com/apache/bookkeeper/pull/2158
> 
> But, I can't find a effective solution from Github and Travis's 
> documentation[1], any thoughts or opinions?
> 
> Best,
> Lamber-Ken
> 
> [1] https://docs.travis-ci.comhttps://support.github.com
>

Re: [DISCUSS] querying commit metadata from spark DataSource

2020-06-01 Thread vbal...@apache.org

I strongly recommend using a separate datasource relation (option 1) to query 
timeline. It is elegant and fits well with spark APIs.
Thanks.Balaji.VOn Saturday, May 30, 2020, 01:18:45 PM PDT, Vinoth Chandar 
 wrote:  

 Hi satish,

Are you looking for similar functionality as HoodieDatasourceHelpers?

We have historically relied on cli to inspect the table, which does not
lend it self well to programmatic access.. overall in like option 1 -
allowing the timeline to be queryable with a standard schema does seem way
nicer.

I am wondering though if we should introduce a new view. Instead we can use
a different data source name -
spark.read.format(“hudi-timeline”).load(basepath). We can start by just
allowing querying of active timeline and expand this to archive timeline?

What do other Think?

On Fri, May 29, 2020 at 2:37 PM Satish Kotha 
wrote:

> Hello folks,
>
> We have a use case to incrementally generate data for hudi table (say
> 'table2')  by transforming data from other hudi table(say, table1). We want
> to atomically store commit timestamps read from table1 into table2 commit
> metadata.
>
> This is similar to how DeltaStreamer operates with kafka offsets. However,
> DeltaStreamer is java code and can easily query kafka offset processed by
> creating metaclient for target table. We want to use pyspark and I don't
> see a good way to query commit metadata of table1 from DataSource.
>
> I'm considering making one of the below changes to hoodie to make this
> easier.
>
> Option1: Add new relation in hudi-spark to query commit metadata. This
> relation would present a 'metadata view' to query and filter metadata.
>
> Option2: Add other DataSource options on top of incremental querying to
> allow fetching from source table. For example, users can specify
> 'hoodie.consume.metadata.table: table2BasePath'  and issue incremental
> query on table1. Then, IncrementalRelation would go read table2 metadata
> first to identify 'consume.start.timestamp' and start incremental read on
> table1 with that timestamp.
>
> Option 2 looks simpler to implement. But, seems a bit hacky because we are
> reading metadata from table2 when data souce is table1.
>
> Option1 is a bit more complex. But, it is cleaner and not tightly coupled
> to incremental reads. For example, use cases other than incremental reads
> can leverage same relation to query metadata
>
> What do you guys think? Let me know if there are other simpler solutions.
> Appreciate any feedback.
>
> Thanks
> Satish
>

Re: [DISCUSS] querying commit metadata from spark DataSource

Re: [DISCUSS] querying commit metadata from spark DataSource

Re: [DISCUSS] querying commit metadata from spark DataSource

Re: [DISSCUSS] Trigger a Travis-CI rebuild without pushing a commit

Re: [DISCUSS] querying commit metadata from spark DataSource

Re: How to extend the timeline server schema to accommodate business metadata

Re: How to extend the timeline server schema to accommodate business metadata

Re: [DISSCUSS] Trigger a Travis-CI rebuild without pushing a commit

Re: [DISCUSS] querying commit metadata from spark DataSource

9 matches

Site Navigation

Mail list logo

Footer information