[jira] [Created] (HUDI-8389) Optimize re-adding missing files to col stats pruning

sivabalan narayanan (Jira) Thu, 17 Oct 2024 17:16:06 -0700

sivabalan narayanan created HUDI-8389:
-----------------------------------------


             Summary: Optimize re-adding missing files to col stats pruning 
                 Key: HUDI-8389
                 URL: https://issues.apache.org/jira/browse/HUDI-8389
             Project: Apache Hudi
          Issue Type: Improvement
          Components: metadata
            Reporter: sivabalan narayanan


Here is out logic to do col stats based pruning

 
h3. Pruning Design:
 * step1 : Fetch latest file slices for pruned partitions (from MDT)
 * step2.a : Fetch stats from Col stats index which outputs in the format 
\{{File1, col1 ➝ stat1}, \{File2, col1 ➝ stat2},...} i.e. one entry per 
file,column combo. Here we are reading using 
HoodieTableMetadata.{*}getRecordsByKeyPrefixes(){*}. just that we are passing 
in just the {*}columns{*}.
 ** step2.b: Apply filter function to prune entries from step 2.a based on the 
list from step 1. col stats value will contain the file name and we filter 
based on that. Output from this step will be latest files looked up from col 
stats partition in MDT.
 ** step2.b : Construct a matrix of the format File1 ➝ \{col1_valuecount, 
col1_minvalue, col1_maxvalue, col2_valuecount, .... } i.e. one entry per file.
 ** step2.c: Get the list of files indexed by col stats.
 ** step2.d: Apply the query predicate and get the list of pruned file names 
over step 2.b.
 ** step3: If there are any files missing to be indexed from col stats (step1 
output - step2.c output), add them back to 2.d to get list of final pruned 
files list. Or in other words, pruned files + missingToIndexFiles are the final 
set of candidate files we return from this step.
 *** lets name the output from step3 as *candidate files.*
 ** step5: For every file slice from step3 => if every file in this file slice 
is missing from the candidate files, we can ignore the file slice(in other 
words, every file in this file slice did not match the predicate from col 
stats, we are safe to ignore the entire file slice). Even if one file is 
present in candidate files, we need to include the file slice in its entirety.

 

Why do we need to re-add the files missing to be indexed from col stats(step 
3). We know there are 2 cases in which this could legitimately happen. For eg, 
log files from failed commit and rollback blocks. We can ignore these files and 
only do pruning based on rest of the files in the file slice. For eg, if we 
have a base file and 5 log files(out of which one is a rollback block) in a 
file slice, if the base file and 4 log files did not match the predicate, we 
should skip the file slice. but as of now, we can't skip this. In summary, as 
per current logic, if a file slice has either of these (rollback block, delete 
block, and data blocks from failed commit), it can never be filtered out w/ col 
stats based pruning. We should definitely revisit this and fix it as much as 
possible. For delete blocks also, I am thinking if we can do the same. i.e. on 
the write path, we can skip adding the entries to col stats. and then while 
pruning only consider files w/ valid stats to prune a file slice. For eg, we 
have a base file and 3 log files, out of which one of them is a delete block. 
We do stats based pruning for base file and 3 log files. If all of them did not 
match, should we filter out the entire file slice? or do we give a benefit of 
doubt and include it (which is what we do as of today)? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-8389) Optimize re-adding missing files to col stats pruning

Reply via email to