bk-mz opened a new issue, #10511:
URL: https://github.com/apache/hudi/issues/10511
**Describe the problem you faced**
We encountered an issue with MOR table that utilizes metadata bloom filters
and Parquet bloom filters, and has enabled statistics. When attempting to query
data, the system does not seem to utilize these bloom filters effectively.
Instead, all requests result in a full partition scan, regardless of the
applied filters.
**To Reproduce**
Steps to reproduce the behavior:
1. Create a MOR table and write data using both Parquet bloom filters and
metadata bloom filters.
2. Attempt to query the data by applying a filter to one of the columns that
participate in bloom filtering. Ensure that the filter narrows down the dataset
size, making the bloom filters more likely to be effective.
3. Observe that the Spark SQL User Interface (UI) displays a full partition
scan.
4. Compare the query latency time for the column with bloom filters (BF) to
the latency time for the column without bloom filters (non-BF).
**Expected behavior**
The expected behavior is that querying the column with bloom filters (BF)
should be significantly more efficient than querying the column without bloom
filters (non-BF).
**Environment Description**
* Hudi version : 0.14.0
* Spark version : 3.5.0 AWS EMR 7.0.0
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no
**Additional context**
Table write hudi params:
```properties
hoodie.bloom.index.filter.type=DYNAMIC_V0
hoodie.bloom.index.prune.by.ranges=false
hoodie.bloom.index.use.metadata=true
hoodie.clean.async=true
hoodie.cleaner.policy.failed.writes=LAZY
hoodie.compact.inline.max.delta.commits=5
hoodie.datasource.hive_sync.database=db_name
hoodie.datasource.hive_sync.enable=true
hoodie.datasource.hive_sync.mode=hms
hoodie.datasource.hive_sync.partition_fields=year,month,day,hour
hoodie.datasource.hive_sync.table=table_name
hoodie.datasource.hive_sync.use_jdbc=false
hoodie.datasource.write.hive_style_partitioning=true
hoodie.datasource.write.partitionpath.field=year,month,day,hour
hoodie.datasource.write.path=s3://s3_path/table
hoodie.datasource.write.precombine.field=date_updated_epoch
hoodie.datasource.write.recordkey.field=id
hoodie.datasource.write.streaming.checkpoint.identifier=main_writer
hoodie.datasource.write.table.type=MERGE_ON_READ
hoodie.enable.data.skipping=true
hoodie.index.type=BLOOM
hoodie.metadata.enable=true
hoodie.metadata.index.async=true
hoodie.metadata.index.bloom.filter.column.list=id,account_id
hoodie.metadata.index.bloom.filter.enable=true
hoodie.metadata.index.column.stats.column.list=id,account_id
hoodie.metadata.index.column.stats.enable=true
hoodie.metricscompaction.log.blocks.on=true
hoodie.table.name=table_name
hoodie.write.concurrency.mode=optimistic_concurrency_control
hoodie.write.lock.dynamodb.partition_key=table_name
hoodie.write.lock.dynamodb.region=us-east-1
hoodie.write.lock.dynamodb.table=hudi-lock
hoodie.write.lock.num_retries=30
hoodie.write.lock.provider=org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
hoodie.write.lock.wait_time_ms=30000
hoodie.write.lock.wait_time_ms_between_retry=10000
```
Hadoop parquet properties:
```properties
parquet.avro.write-old-list-structure=false
parquet.bloom.filter.enabled#account_id=true
parquet.bloom.filter.enabled#id=true
```
If I download the file from s3 and then use parquet cli, it will show that
BF on column is actually used:
```
parquet bloom-filter
fe97585b-8a07-4a74-8445-16b898d1bb2b-0_191-4119-834504_20240116135428462.parquet
-c account_id -v account_id1
Row group 0:
--------------------------------------------------------------------------------
value account_id1 NOT exists.
parquet bloom-filter
fe97585b-8a07-4a74-8445-16b898d1bb2b-0_191-4119-834504_20240116135428462.parquet
-c account_id -v account_id2
Row group 0:
--------------------------------------------------------------------------------
value account_id2 maybe exists.
```
Read part:
```
$ spark-sql \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
\
--conf
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
--jars=/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/hudi/hudi-aws-bundle.jar \
--conf spark.executor.cores=8 \
--conf spark.executor.memory=27G \
--conf spark.driver.cores=8 \
--conf spark.driver.memory=27G```
spark-sql (default)> select count(1) as cnt from table_with_bfs where year =
2024 and month = 1 and day = 5 and account_id = 'id1';
82
Time taken: 34.962 seconds, Fetched 1 row(s)
spark-sql (default)> select count(1) as cnt from table_no_bfs where year =
2024 and month = 1 and day = 5 and account_id = 'id2';
82
Time taken: 26.463 seconds, Fetched 1 row(s)
```
In this particular case `table_no_bfs` does not have any bloom filters for
this day and for some reason takes more time that table w/o BFs.
Number of rows in the table for this partition:
```
spark-sql (default)> select count(1) as cnt from table_with_bfs where year =
2024 and month = 1 and day = 5;
406188272
Time taken: 22.944 seconds, Fetched 1 row(s)```
Spark SQL UI for BF table:

Spark SQL UI for Non-BF table:

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]