hi !

In my use case, for GDPR I have to export all informations of a given
user from several hudi HUGE tables. Filtering the table results in a
full scan of around 10 hours and this will get worst year after year.

Since the filter criteria is based on the bloom key (user_id) it would
be handy to exploit the bloom and produce a temporary table (in the
metastore for eg) with the resulting rows.

So far the bloom indexing is used for update/delete operations on a hudi
table.

1. There is a oportunity to exploit the bloom for select operations.
the hudi options would be:
operation: select
result-table: <table name>
result-path: <s3 path|hdfs path>
result-schema: <table schema in metastore> (optional ; when empty no
sync with the hms, only raw path)


2. It could be implemented as predicate push down in the spark
datasource API. When filtering with a IN statement.


Thought ?

Reply via email to