Re: feature request/proposal: leverage bloom indexes for readingb
Hi, You are right about the datasource API. This is one of the mismatches that prevents us from exposing this more nicely. we are definitely going the route of having a select query taking hints and using index for faster lookup. 0.11 we could try once we have the new multi modal indexing landing. For now, actually you can use ReadClient, it will work fine. I think we used it internally at uber. On Thu, Oct 28, 2021 at 10:22 AM Nicolas Paris wrote: > I tested the HoodieReadClient. It's a great start indeed. Looks like > this client is meant fo testing purpose and needs some enhancement. I > will try to produce a general purpose code aroud this and who knows > contribute. > > I guess the datasource api is not the best candidate since hudi keys > cannot be passed as options but with rdd or df: > > sprark.read.format('hudi').option('hudi.filter.keys', > 'a,flat,list,of,keys,not,really,cool').load(...) > > there is also the option to introduce a new hudi operation such > "select". but again it's not supposed to return a dataframe but write to > the hudi: > > df_hudi_keys.options(**hudi_options).save(...) > > Then a full featured / documented hoodie client is maybe the best option > > > thought ? > > > On Thu Oct 28, 2021 at 2:34 PM CEST, Vinoth Chandar wrote: > > Sounds great! > > > > On Tue, Oct 26, 2021 at 7:26 AM Nicolas Paris > > wrote: > > > > > Hi Vinoth, > > > > > > Thanks for the starter. Definitely once the new way to manage indexes > > > and we get migrated on hudi on our datalake, I d'be glad to give this a > > > shot. > > > > > > > > > Regards, Nicolas > > > > > > On Fri Oct 22, 2021 at 4:33 PM CEST, Vinoth Chandar wrote: > > > > Hi Nicolas, > > > > > > > > Thanks for raising this! I think it's a very valid ask. > > > > https://issues.apache.org/jira/browse/HUDI-2601 has been raised. > > > > > > > > As a proof of concept, would you be able to give filterExists() a > shot > > > > and > > > > see if the filtering time improves? > > > > > > > > https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172 > > > > > > > > In the upcoming 0.10.0 release, we are planning to move the bloom > > > > filters > > > > out to a partition on the metadata table, to even speed this up for > very > > > > large tables. > > > > https://issues.apache.org/jira/browse/HUDI-1295 > > > > > > > > Please let us know if you are interested in testing that when the PR > is > > > > up. > > > > > > > > Thanks > > > > Vinoth > > > > > > > > On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris < > nicolas.pa...@riseup.net> > > > > wrote: > > > > > > > > > hi ! > > > > > > > > > > In my use case, for GDPR I have to export all informations of a > given > > > > > user from several hudi HUGE tables. Filtering the table results in > a > > > > > full scan of around 10 hours and this will get worst year after > year. > > > > > > > > > > Since the filter criteria is based on the bloom key (user_id) it > would > > > > > be handy to exploit the bloom and produce a temporary table (in the > > > > > metastore for eg) with the resulting rows. > > > > > > > > > > So far the bloom indexing is used for update/delete operations on a > > > hudi > > > > > table. > > > > > > > > > > 1. There is a oportunity to exploit the bloom for select > operations. > > > > > the hudi options would be: > > > > > operation: select > > > > > result-table: > > > > > result-path: > > > > > result-schema: (optional ; when empty > no > > > > > sync with the hms, only raw path) > > > > > > > > > > > > > > > 2. It could be implemented as predicate push down in the spark > > > > > datasource API. When filtering with a IN statement. > > > > > > > > > > > > > > > Thought ? > > > > > > > > > > > > >
Re: feature request/proposal: leverage bloom indexes for readingb
I tested the HoodieReadClient. It's a great start indeed. Looks like this client is meant fo testing purpose and needs some enhancement. I will try to produce a general purpose code aroud this and who knows contribute. I guess the datasource api is not the best candidate since hudi keys cannot be passed as options but with rdd or df: sprark.read.format('hudi').option('hudi.filter.keys', 'a,flat,list,of,keys,not,really,cool').load(...) there is also the option to introduce a new hudi operation such "select". but again it's not supposed to return a dataframe but write to the hudi: df_hudi_keys.options(**hudi_options).save(...) Then a full featured / documented hoodie client is maybe the best option thought ? On Thu Oct 28, 2021 at 2:34 PM CEST, Vinoth Chandar wrote: > Sounds great! > > On Tue, Oct 26, 2021 at 7:26 AM Nicolas Paris > wrote: > > > Hi Vinoth, > > > > Thanks for the starter. Definitely once the new way to manage indexes > > and we get migrated on hudi on our datalake, I d'be glad to give this a > > shot. > > > > > > Regards, Nicolas > > > > On Fri Oct 22, 2021 at 4:33 PM CEST, Vinoth Chandar wrote: > > > Hi Nicolas, > > > > > > Thanks for raising this! I think it's a very valid ask. > > > https://issues.apache.org/jira/browse/HUDI-2601 has been raised. > > > > > > As a proof of concept, would you be able to give filterExists() a shot > > > and > > > see if the filtering time improves? > > > > > https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172 > > > > > > In the upcoming 0.10.0 release, we are planning to move the bloom > > > filters > > > out to a partition on the metadata table, to even speed this up for very > > > large tables. > > > https://issues.apache.org/jira/browse/HUDI-1295 > > > > > > Please let us know if you are interested in testing that when the PR is > > > up. > > > > > > Thanks > > > Vinoth > > > > > > On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris > > > wrote: > > > > > > > hi ! > > > > > > > > In my use case, for GDPR I have to export all informations of a given > > > > user from several hudi HUGE tables. Filtering the table results in a > > > > full scan of around 10 hours and this will get worst year after year. > > > > > > > > Since the filter criteria is based on the bloom key (user_id) it would > > > > be handy to exploit the bloom and produce a temporary table (in the > > > > metastore for eg) with the resulting rows. > > > > > > > > So far the bloom indexing is used for update/delete operations on a > > hudi > > > > table. > > > > > > > > 1. There is a oportunity to exploit the bloom for select operations. > > > > the hudi options would be: > > > > operation: select > > > > result-table: > > > > result-path: > > > > result-schema: (optional ; when empty no > > > > sync with the hms, only raw path) > > > > > > > > > > > > 2. It could be implemented as predicate push down in the spark > > > > datasource API. When filtering with a IN statement. > > > > > > > > > > > > Thought ? > > > > > > > >
Re: feature request/proposal: leverage bloom indexes for readingb
Sounds great! On Tue, Oct 26, 2021 at 7:26 AM Nicolas Paris wrote: > Hi Vinoth, > > Thanks for the starter. Definitely once the new way to manage indexes > and we get migrated on hudi on our datalake, I d'be glad to give this a > shot. > > > Regards, Nicolas > > On Fri Oct 22, 2021 at 4:33 PM CEST, Vinoth Chandar wrote: > > Hi Nicolas, > > > > Thanks for raising this! I think it's a very valid ask. > > https://issues.apache.org/jira/browse/HUDI-2601 has been raised. > > > > As a proof of concept, would you be able to give filterExists() a shot > > and > > see if the filtering time improves? > > > https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172 > > > > In the upcoming 0.10.0 release, we are planning to move the bloom > > filters > > out to a partition on the metadata table, to even speed this up for very > > large tables. > > https://issues.apache.org/jira/browse/HUDI-1295 > > > > Please let us know if you are interested in testing that when the PR is > > up. > > > > Thanks > > Vinoth > > > > On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris > > wrote: > > > > > hi ! > > > > > > In my use case, for GDPR I have to export all informations of a given > > > user from several hudi HUGE tables. Filtering the table results in a > > > full scan of around 10 hours and this will get worst year after year. > > > > > > Since the filter criteria is based on the bloom key (user_id) it would > > > be handy to exploit the bloom and produce a temporary table (in the > > > metastore for eg) with the resulting rows. > > > > > > So far the bloom indexing is used for update/delete operations on a > hudi > > > table. > > > > > > 1. There is a oportunity to exploit the bloom for select operations. > > > the hudi options would be: > > > operation: select > > > result-table: > > > result-path: > > > result-schema: (optional ; when empty no > > > sync with the hms, only raw path) > > > > > > > > > 2. It could be implemented as predicate push down in the spark > > > datasource API. When filtering with a IN statement. > > > > > > > > > Thought ? > > > > >
Re: feature request/proposal: leverage bloom indexes for readingb
Hi Vinoth, Thanks for the starter. Definitely once the new way to manage indexes and we get migrated on hudi on our datalake, I d'be glad to give this a shot. Regards, Nicolas On Fri Oct 22, 2021 at 4:33 PM CEST, Vinoth Chandar wrote: > Hi Nicolas, > > Thanks for raising this! I think it's a very valid ask. > https://issues.apache.org/jira/browse/HUDI-2601 has been raised. > > As a proof of concept, would you be able to give filterExists() a shot > and > see if the filtering time improves? > https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172 > > In the upcoming 0.10.0 release, we are planning to move the bloom > filters > out to a partition on the metadata table, to even speed this up for very > large tables. > https://issues.apache.org/jira/browse/HUDI-1295 > > Please let us know if you are interested in testing that when the PR is > up. > > Thanks > Vinoth > > On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris > wrote: > > > hi ! > > > > In my use case, for GDPR I have to export all informations of a given > > user from several hudi HUGE tables. Filtering the table results in a > > full scan of around 10 hours and this will get worst year after year. > > > > Since the filter criteria is based on the bloom key (user_id) it would > > be handy to exploit the bloom and produce a temporary table (in the > > metastore for eg) with the resulting rows. > > > > So far the bloom indexing is used for update/delete operations on a hudi > > table. > > > > 1. There is a oportunity to exploit the bloom for select operations. > > the hudi options would be: > > operation: select > > result-table: > > result-path: > > result-schema: (optional ; when empty no > > sync with the hms, only raw path) > > > > > > 2. It could be implemented as predicate push down in the spark > > datasource API. When filtering with a IN statement. > > > > > > Thought ? > >
Re: feature request/proposal: leverage bloom indexes for readingb
Hi Nicolas, Thanks for raising this! I think it's a very valid ask. https://issues.apache.org/jira/browse/HUDI-2601 has been raised. As a proof of concept, would you be able to give filterExists() a shot and see if the filtering time improves? https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172 In the upcoming 0.10.0 release, we are planning to move the bloom filters out to a partition on the metadata table, to even speed this up for very large tables. https://issues.apache.org/jira/browse/HUDI-1295 Please let us know if you are interested in testing that when the PR is up. Thanks Vinoth On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris wrote: > hi ! > > In my use case, for GDPR I have to export all informations of a given > user from several hudi HUGE tables. Filtering the table results in a > full scan of around 10 hours and this will get worst year after year. > > Since the filter criteria is based on the bloom key (user_id) it would > be handy to exploit the bloom and produce a temporary table (in the > metastore for eg) with the resulting rows. > > So far the bloom indexing is used for update/delete operations on a hudi > table. > > 1. There is a oportunity to exploit the bloom for select operations. > the hudi options would be: > operation: select > result-table: > result-path: > result-schema: (optional ; when empty no > sync with the hms, only raw path) > > > 2. It could be implemented as predicate push down in the spark > datasource API. When filtering with a IN statement. > > > Thought ? >
feature request/proposal: leverage bloom indexes for readingb
hi ! In my use case, for GDPR I have to export all informations of a given user from several hudi HUGE tables. Filtering the table results in a full scan of around 10 hours and this will get worst year after year. Since the filter criteria is based on the bloom key (user_id) it would be handy to exploit the bloom and produce a temporary table (in the metastore for eg) with the resulting rows. So far the bloom indexing is used for update/delete operations on a hudi table. 1. There is a oportunity to exploit the bloom for select operations. the hudi options would be: operation: select result-table: result-path: result-schema: (optional ; when empty no sync with the hms, only raw path) 2. It could be implemented as predicate push down in the spark datasource API. When filtering with a IN statement. Thought ?