[ 
https://issues.apache.org/jira/browse/HUDI-8389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-8389.
-------------------------------------
    Resolution: Won't Fix

> Optimize re-adding missing files to col stats pruning 
> ------------------------------------------------------
>
>                 Key: HUDI-8389
>                 URL: https://issues.apache.org/jira/browse/HUDI-8389
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: metadata
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Blocker
>             Fix For: 1.0.1
>
>   Original Estimate: 2h
>          Time Spent: 1h
>  Remaining Estimate: 1h
>
> Here is out logic to do col stats based pruning
>  
> h3. Pruning Design:
>  * step1 : Fetch latest file slices for pruned partitions (from MDT)
>  * step2.a : Fetch stats from Col stats index which outputs in the format 
> \{{File1, col1 ➝ stat1}, \{File2, col1 ➝ stat2},...} i.e. one entry per 
> file,column combo. Here we are reading using 
> HoodieTableMetadata.{*}getRecordsByKeyPrefixes(){*}. just that we are passing 
> in just the {*}columns{*}.
>  ** step2.b: Apply filter function to prune entries from step 2.a based on 
> the list from step 1. col stats value will contain the file name and we 
> filter based on that. Output from this step will be latest files looked up 
> from col stats partition in MDT.
>  ** step2.b : Construct a matrix of the format File1 ➝ \{col1_valuecount, 
> col1_minvalue, col1_maxvalue, col2_valuecount, .... } i.e. one entry per file.
>  ** step2.c: Get the list of files indexed by col stats.
>  ** step2.d: Apply the query predicate and get the list of pruned file names 
> over step 2.b.
>  ** step3: If there are any files missing to be indexed from col stats (step1 
> output - step2.c output), add them back to 2.d to get list of final pruned 
> files list. Or in other words, pruned files + missingToIndexFiles are the 
> final set of candidate files we return from this step.
>  *** lets name the output from step3 as *candidate files.*
>  ** step5: For every file slice from step3 => if every file in this file 
> slice is missing from the candidate files, we can ignore the file slice(in 
> other words, every file in this file slice did not match the predicate from 
> col stats, we are safe to ignore the entire file slice). Even if one file is 
> present in candidate files, we need to include the file slice in its entirety.
>  
> Why do we need to re-add the files missing to be indexed from col stats(step 
> 3). We know there are 2 cases in which this could legitimately happen. For 
> eg, log files from failed commit and rollback blocks. We can ignore these 
> files and only do pruning based on rest of the files in the file slice. For 
> eg, if we have a base file and 5 log files(out of which one is a rollback 
> block) in a file slice, if the base file and 4 log files did not match the 
> predicate, we should skip the file slice. but as of now, we can't skip this. 
> In summary, as per current logic, if a file slice has either of these 
> (rollback block, delete block, and data blocks from failed commit), it can 
> never be filtered out w/ col stats based pruning. We should definitely 
> revisit this and fix it as much as possible. For delete blocks also, I am 
> thinking if we can do the same. i.e. on the write path, we can skip adding 
> the entries to col stats. and then while pruning only consider files w/ valid 
> stats to prune a file slice. For eg, we have a base file and 3 log files, out 
> of which one of them is a delete block. We do stats based pruning for base 
> file and 3 log files. If all of them did not match, should we filter out the 
> entire file slice? or do we give a benefit of doubt and include it (which is 
> what we do as of today)? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to