sivabalan narayanan created HUDI-8588:
-----------------------------------------
Summary: Col stats pruning is ignoring to leverage stats from log
files
Key: HUDI-8588
URL: https://issues.apache.org/jira/browse/HUDI-8588
Project: Apache Hudi
Issue Type: Bug
Components: metadata
Reporter: sivabalan narayanan
[https://github.com/apache/hudi/blob/5c28762c80c678e66df662c0f4e4c2855766840f/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala#L88]
here when we call
SparkBaseIndexSupport.getPrunedPartitionsAndFileNames(), we are not setting any
value for
includeLogFiles. in which case, the default value of false takes effect. So, we
are not pruning effectively. any file slice containing log files will never be
pruned out due to this bug.
proposed fix:
[https://github.com/apache/hudi/pull/12331/files/f5319839b1bc9147e89582b5fe3172340387c11f..5ce49a2d9257962f6bc31e1dfeabe95ed4960c58#r1859424562]
I checked our existing tests. Most tests are directly testing
ColStatsIndexSupport and setting the value for "includeLogFiles" to true and
hence the tests expectations worked as expected. but looks like we are missing
end to end functional tests and verify the pruning behavior, especially, when
none of the log files matches a predicate.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)