Reimus opened a new issue, #5808:
URL: https://github.com/apache/hudi/issues/5808
After writing Hudi table using spark command
```
ds.write
.format("hudi")
.mode(SaveMode.Append)
.option(DataSourceWriteOptions.PRECOMBINE_FIELD.key, "ts")
.option(DataSourceWriteOptions.RECORDKEY_FIELD.key, "id")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD.key, "ym")
.option(DataSourceWriteOptions.OPERATION.key,
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
.option(HoodieWriteConfig.TBL_NAME.key, tableName)
.option(DataSourceWriteOptions.RECONCILE_SCHEMA.key, "true")
.option(DataSourceWriteOptions.TABLE_TYPE.key,
DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
.option(DataSourceWriteOptions.OPERATION.key,
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
.option(HoodieTableConfig.TIMELINE_TIMEZONE.key,
HoodieTimelineTimeZone.UTC.name)
.option(HoodieWriteConfig.SCHEMA_EVOLUTION_ENABLE.key, "true")
.option(HoodieWriteConfig.AVRO_SCHEMA_VALIDATE_ENABLE.key, "true")
.option(HoodieWriteConfig.WRITE_CONCURRENCY_MODE.key,
WriteConcurrencyMode.OPTIMISTIC_CONCURRENCY_CONTROL.name())
.option(HoodieIndexConfig.BLOOM_FILTER_TYPE.key,
BloomFilterTypeCode.DYNAMIC_V0.name)
.option(HoodieIndexConfig.BLOOM_FILTER_NUM_ENTRIES_VALUE.key,
String.valueOf(100000))
.option(HoodieIndexConfig.BLOOM_INDEX_USE_METADATA.key, "true")
.option(HoodieIndexConfig.BLOOM_INDEX_FILTER_DYNAMIC_MAX_ENTRIES.key,
String.valueOf(1000000))
.option(HoodieLockConfig.HIVE_DATABASE_NAME.key, databaseName)
.option(HoodieLockConfig.HIVE_TABLE_NAME.key, tableName)
.option(HoodieLockConfig.HIVE_METASTORE_URI.key,
env.spark.hiveMetastore)
.option(HoodieLockConfig.LOCK_PROVIDER_CLASS_NAME.key,
classOf[org.apache.hudi.hive.HiveMetastoreBasedLockProvider].getName)
.option(HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key,
String.valueOf(256 * 1024 * 1024))
.option(HoodieStorageConfig.PARQUET_BLOCK_SIZE.key, String.valueOf(256
* 1024 * 1024))
.option(HoodieCompactionConfig.AUTO_CLEAN.key, "true")
.option(HoodieCompactionConfig.FAILED_WRITES_CLEANER_POLICY.key,
HoodieFailedWritesCleaningPolicy.LAZY.name)
.option(HoodieCompactionConfig.CLEANER_POLICY.key,
HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS.name())
.option(HoodieCompactionConfig.CLEANER_HOURS_RETAINED.key,
String.valueOf(24))
.option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT.key,
String.valueOf(104857600))
.option(HoodieMetadataConfig.COLUMN_STATS_INDEX_FOR_COLUMNS.key,
"ym,ymd,date,ts,lvl1.ymd,lvl1.lvl2.date")
.option(HoodieMetadataConfig.BLOOM_FILTER_INDEX_FOR_COLUMNS.key,
"id,col1,col2")
.option(HoodieMetadataConfig.POPULATE_META_FIELDS.key, "true")
.option(HoodieMetadataConfig.ENABLE_METADATA_INDEX_COLUMN_STATS.key,
"true")
.option(HoodieMetadataConfig.ENABLE_METADATA_INDEX_BLOOM_FILTER.key,
"true")
.option(HoodieMetadataConfig.ENABLE.key, "true")
.save("/tmp/hudi")
```
And reading said table using:
```
val s =
spark.read.format("hudi").option("hoodie.datasource.query.type","read_optimized").option("hoodie.file.index.enable","true").option("hoodie.enable.data.skipping","true").option("hoodie.metadata.enable","true").
option("hoodie.metadata.index.column.stats.enable","true").option("","true").option("hoodie.datasource.read.extract.partition.values.from.path","true").load("/tmp/hudi")
s.where('col1==="values").show
s.where('col3==="values").show
```
Where col1 is in the BLOOM_FILTER_INDEX_FOR_COLUMNS array, while col3 is
not. and "values" is not expected to be found in the table.
For both of the queries - same number of files is being scanned.
**Expected behavior**
Since the value is expected not to be found - small number of files (false
positives) is expected to be scanned for first query.
Full table scan is expected for second query.
**Environment Description**
* Hudi version : 0.11
* Spark version : 3.1.2
* Hadoop version : 3.0.X
* Storage (HDFS/S3/GCS..) : HDFS
* Running on Docker? (yes/no) : no
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]