[ 
https://issues.apache.org/jira/browse/HUDI-8588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-8588:
--------------------------------------
    Fix Version/s: 1.0.0

> Col stats pruning is ignoring to leverage stats from log files
> --------------------------------------------------------------
>
>                 Key: HUDI-8588
>                 URL: https://issues.apache.org/jira/browse/HUDI-8588
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: metadata
>            Reporter: sivabalan narayanan
>            Priority: Major
>             Fix For: 1.0.0
>
>
> [https://github.com/apache/hudi/blob/5c28762c80c678e66df662c0f4e4c2855766840f/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala#L88]
>  
>  
> here when we call 
> SparkBaseIndexSupport.getPrunedPartitionsAndFileNames(), we are not setting 
> any value for 
> includeLogFiles. in which case, the default value of false takes effect. So, 
> we are not pruning effectively. any file slice containing log files will 
> never be pruned out due to this bug. 
> proposed fix: 
> [https://github.com/apache/hudi/pull/12331/files/f5319839b1bc9147e89582b5fe3172340387c11f..5ce49a2d9257962f6bc31e1dfeabe95ed4960c58#r1859424562]
>  
>  
> I checked our existing tests. Most tests are directly testing 
> ColStatsIndexSupport and setting the value for "includeLogFiles" to true and 
> hence the tests expectations worked as expected. but looks like we are 
> missing end to end functional tests and verify the pruning behavior, 
> especially, when none of the log files matches a predicate. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to