hi ! In my use case, for GDPR I have to export all informations of a given user from several hudi HUGE tables. Filtering the table results in a full scan of around 10 hours and this will get worst year after year.
Since the filter criteria is based on the bloom key (user_id) it would be handy to exploit the bloom and produce a temporary table (in the metastore for eg) with the resulting rows. So far the bloom indexing is used for update/delete operations on a hudi table. 1. There is a oportunity to exploit the bloom for select operations. the hudi options would be: operation: select result-table: <table name> result-path: <s3 path|hdfs path> result-schema: <table schema in metastore> (optional ; when empty no sync with the hms, only raw path) 2. It could be implemented as predicate push down in the spark datasource API. When filtering with a IN statement. Thought ?