sivabalan narayanan created HUDI-8588:
-----------------------------------------

             Summary: Col stats pruning is ignoring to leverage stats from log 
files
                 Key: HUDI-8588
                 URL: https://issues.apache.org/jira/browse/HUDI-8588
             Project: Apache Hudi
          Issue Type: Bug
          Components: metadata
            Reporter: sivabalan narayanan


[https://github.com/apache/hudi/blob/5c28762c80c678e66df662c0f4e4c2855766840f/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala#L88]
 

 

here when we call 

SparkBaseIndexSupport.getPrunedPartitionsAndFileNames(), we are not setting any 
value for 

includeLogFiles. in which case, the default value of false takes effect. So, we 
are not pruning effectively. any file slice containing log files will never be 
pruned out due to this bug. 

proposed fix: 

[https://github.com/apache/hudi/pull/12331/files/f5319839b1bc9147e89582b5fe3172340387c11f..5ce49a2d9257962f6bc31e1dfeabe95ed4960c58#r1859424562]
 

 

I checked our existing tests. Most tests are directly testing 
ColStatsIndexSupport and setting the value for "includeLogFiles" to true and 
hence the tests expectations worked as expected. but looks like we are missing 
end to end functional tests and verify the pruning behavior, especially, when 
none of the log files matches a predicate. 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to