Zhangshunyu opened a new issue, #7032:
URL: https://github.com/apache/hudi/issues/7032
When we enable metadata table, we use "id, t" as stats column and dataskip
is enabled, we get some id values from table (both values exist in table) as
filter to query details, but we find that some id will get result but some will
be empty, the query like following:
select * from table_a where id in ('id001');
select * from table_a where id in ('id002');
both 'id001' and 'id002' exist, but 'id001' can get result , but 'id002' get
empty result.
by the way, we find the candidate files after index filter applied is empty
for 'id002', it seems the MIN/MAX values has some problem in metatable?
our config as following:
hudi 0.11
spark 3.1.1
DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "years,months,days",
"hoodie.sql.insert.mode" -> "non-strict",
"hoodie.bulkinsert.sort.mode" -> "GLOBAL_SORT",
"hoodie.metadata.enable" -> "true",
"hoodie.bulkinsert.shuffle.parallelism" -> "300",
"hoodie.parquet.max.file.size" -> "134217728",
"hoodie.parquet.compression.codec" -> "snappy",
"hoodie.parquet.dictionary.enabled" -> "false",
"hoodie.metadata.index.column.stats.enable" -> "true",
"hoodie.enable.data.skipping" -> "true",
"hoodie.cleaner.policy.failed.writes" -> "LAZY",
"hoodie.clean.automatic" -> "false",
"hoodie.metadata.index.column.stats.column.list" ->"id, t",
"hoodie.metadata.index.column.stats.file.group.count" -> "10",
"hoodie.metadata.clean.async" -> "true",
"hoodie.metadata.compact.max.delta.commits" -> "4")
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]