I tested the HoodieReadClient. It's a great start indeed. Looks like
this client is meant fo testing purpose and needs some enhancement. I
will try to produce a general purpose code aroud this and who knows
contribute.

I guess the datasource api is not the best candidate since hudi keys
cannot be passed as options but with rdd or df:

sprark.read.format('hudi').option('hudi.filter.keys',
'a,flat,list,of,keys,not,really,cool').load(...)

there is also the option to introduce a new hudi operation such
"select". but again it's not supposed to return a dataframe but write to
the hudi:

df_hudi_keys.options(**hudi_options).save(...)

Then a full featured / documented hoodie client is maybe the best option


thought ?


On Thu Oct 28, 2021 at 2:34 PM CEST, Vinoth Chandar wrote:
> Sounds great!
>
> On Tue, Oct 26, 2021 at 7:26 AM Nicolas Paris <nicolas.pa...@riseup.net>
> wrote:
>
> > Hi Vinoth,
> >
> > Thanks for the starter. Definitely once the new way to manage indexes
> > and we get migrated on hudi on our datalake, I d'be glad to give this a
> > shot.
> >
> >
> > Regards, Nicolas
> >
> > On Fri Oct 22, 2021 at 4:33 PM CEST, Vinoth Chandar wrote:
> > > Hi Nicolas,
> > >
> > > Thanks for raising this! I think it's a very valid ask.
> > > https://issues.apache.org/jira/browse/HUDI-2601 has been raised.
> > >
> > > As a proof of concept, would you be able to give filterExists() a shot
> > > and
> > > see if the filtering time improves?
> > >
> > https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L172
> > >
> > > In the upcoming 0.10.0 release, we are planning to move the bloom
> > > filters
> > > out to a partition on the metadata table, to even speed this up for very
> > > large tables.
> > > https://issues.apache.org/jira/browse/HUDI-1295
> > >
> > > Please let us know if you are interested in testing that when the PR is
> > > up.
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris <nicolas.pa...@riseup.net>
> > > wrote:
> > >
> > > > hi !
> > > >
> > > > In my use case, for GDPR I have to export all informations of a given
> > > > user from several hudi HUGE tables. Filtering the table results in a
> > > > full scan of around 10 hours and this will get worst year after year.
> > > >
> > > > Since the filter criteria is based on the bloom key (user_id) it would
> > > > be handy to exploit the bloom and produce a temporary table (in the
> > > > metastore for eg) with the resulting rows.
> > > >
> > > > So far the bloom indexing is used for update/delete operations on a
> > hudi
> > > > table.
> > > >
> > > > 1. There is a oportunity to exploit the bloom for select operations.
> > > > the hudi options would be:
> > > > operation: select
> > > > result-table: <table name>
> > > > result-path: <s3 path|hdfs path>
> > > > result-schema: <table schema in metastore> (optional ; when empty no
> > > > sync with the hms, only raw path)
> > > >
> > > >
> > > > 2. It could be implemented as predicate push down in the spark
> > > > datasource API. When filtering with a IN statement.
> > > >
> > > >
> > > > Thought ?
> > > >
> >
> >

Reply via email to