Re: feature request/proposal: leverage bloom indexes for readingb

2021-11-03 Thread Vinoth Chandar
Hi,

You are right about the datasource API. This is one of the mismatches that
prevents us from exposing this more nicely.

we are definitely going the route of having a select query taking hints and
using index for faster lookup. 0.11 we could try once we have the new multi
modal indexing landing.

For now, actually you can use ReadClient, it will work fine. I think we
used it internally at uber.


On Thu, Oct 28, 2021 at 10:22 AM Nicolas Paris 
wrote:

> I tested the HoodieReadClient. It's a great start indeed. Looks like
> this client is meant fo testing purpose and needs some enhancement. I
> will try to produce a general purpose code aroud this and who knows
> contribute.
>
> I guess the datasource api is not the best candidate since hudi keys
> cannot be passed as options but with rdd or df:
>
> sprark.read.format('hudi').option('hudi.filter.keys',
> 'a,flat,list,of,keys,not,really,cool').load(...)
>
> there is also the option to introduce a new hudi operation such
> "select". but again it's not supposed to return a dataframe but write to
> the hudi:
>
> df_hudi_keys.options(**hudi_options).save(...)
>
> Then a full featured / documented hoodie client is maybe the best option
>
>
> thought ?
>
>
> On Thu Oct 28, 2021 at 2:34 PM CEST, Vinoth Chandar wrote:
> > Sounds great!
> >
> > On Tue, Oct 26, 2021 at 7:26 AM Nicolas Paris 
> > wrote:
> >
> > > Hi Vinoth,
> > >
> > > Thanks for the starter. Definitely once the new way to manage indexes
> > > and we get migrated on hudi on our datalake, I d'be glad to give this a
> > > shot.
> > >
> > >
> > > Regards, Nicolas
> > >
> > > On Fri Oct 22, 2021 at 4:33 PM CEST, Vinoth Chandar wrote:
> > > > Hi Nicolas,
> > > >
> > > > Thanks for raising this! I think it's a very valid ask.
> > > > https://issues.apache.org/jira/browse/HUDI-2601 has been raised.
> > > >
> > > > As a proof of concept, would you be able to give filterExists() a
> shot
> > > > and
> > > > see if the filtering time improves?
> > > >
> > >
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172
> > > >
> > > > In the upcoming 0.10.0 release, we are planning to move the bloom
> > > > filters
> > > > out to a partition on the metadata table, to even speed this up for
> very
> > > > large tables.
> > > > https://issues.apache.org/jira/browse/HUDI-1295
> > > >
> > > > Please let us know if you are interested in testing that when the PR
> is
> > > > up.
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > > > On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris <
> nicolas.pa...@riseup.net>
> > > > wrote:
> > > >
> > > > > hi !
> > > > >
> > > > > In my use case, for GDPR I have to export all informations of a
> given
> > > > > user from several hudi HUGE tables. Filtering the table results in
> a
> > > > > full scan of around 10 hours and this will get worst year after
> year.
> > > > >
> > > > > Since the filter criteria is based on the bloom key (user_id) it
> would
> > > > > be handy to exploit the bloom and produce a temporary table (in the
> > > > > metastore for eg) with the resulting rows.
> > > > >
> > > > > So far the bloom indexing is used for update/delete operations on a
> > > hudi
> > > > > table.
> > > > >
> > > > > 1. There is a oportunity to exploit the bloom for select
> operations.
> > > > > the hudi options would be:
> > > > > operation: select
> > > > > result-table: 
> > > > > result-path: 
> > > > > result-schema:  (optional ; when empty
> no
> > > > > sync with the hms, only raw path)
> > > > >
> > > > >
> > > > > 2. It could be implemented as predicate push down in the spark
> > > > > datasource API. When filtering with a IN statement.
> > > > >
> > > > >
> > > > > Thought ?
> > > > >
> > >
> > >
>
>


Re: feature request/proposal: leverage bloom indexes for readingb

2021-10-28 Thread Nicolas Paris
I tested the HoodieReadClient. It's a great start indeed. Looks like
this client is meant fo testing purpose and needs some enhancement. I
will try to produce a general purpose code aroud this and who knows
contribute.

I guess the datasource api is not the best candidate since hudi keys
cannot be passed as options but with rdd or df:

sprark.read.format('hudi').option('hudi.filter.keys',
'a,flat,list,of,keys,not,really,cool').load(...)

there is also the option to introduce a new hudi operation such
"select". but again it's not supposed to return a dataframe but write to
the hudi:

df_hudi_keys.options(**hudi_options).save(...)

Then a full featured / documented hoodie client is maybe the best option


thought ?


On Thu Oct 28, 2021 at 2:34 PM CEST, Vinoth Chandar wrote:
> Sounds great!
>
> On Tue, Oct 26, 2021 at 7:26 AM Nicolas Paris 
> wrote:
>
> > Hi Vinoth,
> >
> > Thanks for the starter. Definitely once the new way to manage indexes
> > and we get migrated on hudi on our datalake, I d'be glad to give this a
> > shot.
> >
> >
> > Regards, Nicolas
> >
> > On Fri Oct 22, 2021 at 4:33 PM CEST, Vinoth Chandar wrote:
> > > Hi Nicolas,
> > >
> > > Thanks for raising this! I think it's a very valid ask.
> > > https://issues.apache.org/jira/browse/HUDI-2601 has been raised.
> > >
> > > As a proof of concept, would you be able to give filterExists() a shot
> > > and
> > > see if the filtering time improves?
> > >
> > https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172
> > >
> > > In the upcoming 0.10.0 release, we are planning to move the bloom
> > > filters
> > > out to a partition on the metadata table, to even speed this up for very
> > > large tables.
> > > https://issues.apache.org/jira/browse/HUDI-1295
> > >
> > > Please let us know if you are interested in testing that when the PR is
> > > up.
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris 
> > > wrote:
> > >
> > > > hi !
> > > >
> > > > In my use case, for GDPR I have to export all informations of a given
> > > > user from several hudi HUGE tables. Filtering the table results in a
> > > > full scan of around 10 hours and this will get worst year after year.
> > > >
> > > > Since the filter criteria is based on the bloom key (user_id) it would
> > > > be handy to exploit the bloom and produce a temporary table (in the
> > > > metastore for eg) with the resulting rows.
> > > >
> > > > So far the bloom indexing is used for update/delete operations on a
> > hudi
> > > > table.
> > > >
> > > > 1. There is a oportunity to exploit the bloom for select operations.
> > > > the hudi options would be:
> > > > operation: select
> > > > result-table: 
> > > > result-path: 
> > > > result-schema:  (optional ; when empty no
> > > > sync with the hms, only raw path)
> > > >
> > > >
> > > > 2. It could be implemented as predicate push down in the spark
> > > > datasource API. When filtering with a IN statement.
> > > >
> > > >
> > > > Thought ?
> > > >
> >
> >



Re: feature request/proposal: leverage bloom indexes for readingb

2021-10-28 Thread Vinoth Chandar
Sounds great!

On Tue, Oct 26, 2021 at 7:26 AM Nicolas Paris 
wrote:

> Hi Vinoth,
>
> Thanks for the starter. Definitely once the new way to manage indexes
> and we get migrated on hudi on our datalake, I d'be glad to give this a
> shot.
>
>
> Regards, Nicolas
>
> On Fri Oct 22, 2021 at 4:33 PM CEST, Vinoth Chandar wrote:
> > Hi Nicolas,
> >
> > Thanks for raising this! I think it's a very valid ask.
> > https://issues.apache.org/jira/browse/HUDI-2601 has been raised.
> >
> > As a proof of concept, would you be able to give filterExists() a shot
> > and
> > see if the filtering time improves?
> >
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172
> >
> > In the upcoming 0.10.0 release, we are planning to move the bloom
> > filters
> > out to a partition on the metadata table, to even speed this up for very
> > large tables.
> > https://issues.apache.org/jira/browse/HUDI-1295
> >
> > Please let us know if you are interested in testing that when the PR is
> > up.
> >
> > Thanks
> > Vinoth
> >
> > On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris 
> > wrote:
> >
> > > hi !
> > >
> > > In my use case, for GDPR I have to export all informations of a given
> > > user from several hudi HUGE tables. Filtering the table results in a
> > > full scan of around 10 hours and this will get worst year after year.
> > >
> > > Since the filter criteria is based on the bloom key (user_id) it would
> > > be handy to exploit the bloom and produce a temporary table (in the
> > > metastore for eg) with the resulting rows.
> > >
> > > So far the bloom indexing is used for update/delete operations on a
> hudi
> > > table.
> > >
> > > 1. There is a oportunity to exploit the bloom for select operations.
> > > the hudi options would be:
> > > operation: select
> > > result-table: 
> > > result-path: 
> > > result-schema:  (optional ; when empty no
> > > sync with the hms, only raw path)
> > >
> > >
> > > 2. It could be implemented as predicate push down in the spark
> > > datasource API. When filtering with a IN statement.
> > >
> > >
> > > Thought ?
> > >
>
>


Re: feature request/proposal: leverage bloom indexes for readingb

2021-10-26 Thread Nicolas Paris
Hi Vinoth,

Thanks for the starter. Definitely once the new way to manage indexes
and we get migrated on hudi on our datalake, I d'be glad to give this a
shot.


Regards, Nicolas

On Fri Oct 22, 2021 at 4:33 PM CEST, Vinoth Chandar wrote:
> Hi Nicolas,
>
> Thanks for raising this! I think it's a very valid ask.
> https://issues.apache.org/jira/browse/HUDI-2601 has been raised.
>
> As a proof of concept, would you be able to give filterExists() a shot
> and
> see if the filtering time improves?
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172
>
> In the upcoming 0.10.0 release, we are planning to move the bloom
> filters
> out to a partition on the metadata table, to even speed this up for very
> large tables.
> https://issues.apache.org/jira/browse/HUDI-1295
>
> Please let us know if you are interested in testing that when the PR is
> up.
>
> Thanks
> Vinoth
>
> On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris 
> wrote:
>
> > hi !
> >
> > In my use case, for GDPR I have to export all informations of a given
> > user from several hudi HUGE tables. Filtering the table results in a
> > full scan of around 10 hours and this will get worst year after year.
> >
> > Since the filter criteria is based on the bloom key (user_id) it would
> > be handy to exploit the bloom and produce a temporary table (in the
> > metastore for eg) with the resulting rows.
> >
> > So far the bloom indexing is used for update/delete operations on a hudi
> > table.
> >
> > 1. There is a oportunity to exploit the bloom for select operations.
> > the hudi options would be:
> > operation: select
> > result-table: 
> > result-path: 
> > result-schema:  (optional ; when empty no
> > sync with the hms, only raw path)
> >
> >
> > 2. It could be implemented as predicate push down in the spark
> > datasource API. When filtering with a IN statement.
> >
> >
> > Thought ?
> >



Re: feature request/proposal: leverage bloom indexes for readingb

2021-10-22 Thread Vinoth Chandar
Hi Nicolas,

Thanks for raising this! I think it's a very valid ask.
https://issues.apache.org/jira/browse/HUDI-2601 has been raised.

As a proof of concept, would you be able to give filterExists() a shot  and
see if the filtering time improves?
https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172

In the upcoming 0.10.0 release, we are planning to move the bloom filters
out to a partition on the metadata table, to even speed this up for very
large tables.
https://issues.apache.org/jira/browse/HUDI-1295

Please let us know if you are interested in testing that when the PR is up.

Thanks
Vinoth

On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris 
wrote:

> hi !
>
> In my use case, for GDPR I have to export all informations of a given
> user from several hudi HUGE tables. Filtering the table results in a
> full scan of around 10 hours and this will get worst year after year.
>
> Since the filter criteria is based on the bloom key (user_id) it would
> be handy to exploit the bloom and produce a temporary table (in the
> metastore for eg) with the resulting rows.
>
> So far the bloom indexing is used for update/delete operations on a hudi
> table.
>
> 1. There is a oportunity to exploit the bloom for select operations.
> the hudi options would be:
> operation: select
> result-table: 
> result-path: 
> result-schema:  (optional ; when empty no
> sync with the hms, only raw path)
>
>
> 2. It could be implemented as predicate push down in the spark
> datasource API. When filtering with a IN statement.
>
>
> Thought ?
>


feature request/proposal: leverage bloom indexes for readingb

2021-10-19 Thread Nicolas Paris
hi !

In my use case, for GDPR I have to export all informations of a given
user from several hudi HUGE tables. Filtering the table results in a
full scan of around 10 hours and this will get worst year after year.

Since the filter criteria is based on the bloom key (user_id) it would
be handy to exploit the bloom and produce a temporary table (in the
metastore for eg) with the resulting rows.

So far the bloom indexing is used for update/delete operations on a hudi
table.

1. There is a oportunity to exploit the bloom for select operations.
the hudi options would be:
operation: select
result-table: 
result-path: 
result-schema:  (optional ; when empty no
sync with the hms, only raw path)


2. It could be implemented as predicate push down in the spark
datasource API. When filtering with a IN statement.


Thought ?