sivabalan narayanan created HUDI-8389:
-----------------------------------------
Summary: Optimize re-adding missing files to col stats pruning
Key: HUDI-8389
URL: https://issues.apache.org/jira/browse/HUDI-8389
Project: Apache Hudi
Issue Type: Improvement
Components: metadata
Reporter: sivabalan narayanan
Here is out logic to do col stats based pruning
h3. Pruning Design:
* step1 : Fetch latest file slices for pruned partitions (from MDT)
* step2.a : Fetch stats from Col stats index which outputs in the format
\{{File1, col1 ➝ stat1}, \{File2, col1 ➝ stat2},...} i.e. one entry per
file,column combo. Here we are reading using
HoodieTableMetadata.{*}getRecordsByKeyPrefixes(){*}. just that we are passing
in just the {*}columns{*}.
** step2.b: Apply filter function to prune entries from step 2.a based on the
list from step 1. col stats value will contain the file name and we filter
based on that. Output from this step will be latest files looked up from col
stats partition in MDT.
** step2.b : Construct a matrix of the format File1 ➝ \{col1_valuecount,
col1_minvalue, col1_maxvalue, col2_valuecount, .... } i.e. one entry per file.
** step2.c: Get the list of files indexed by col stats.
** step2.d: Apply the query predicate and get the list of pruned file names
over step 2.b.
** step3: If there are any files missing to be indexed from col stats (step1
output - step2.c output), add them back to 2.d to get list of final pruned
files list. Or in other words, pruned files + missingToIndexFiles are the final
set of candidate files we return from this step.
*** lets name the output from step3 as *candidate files.*
** step5: For every file slice from step3 => if every file in this file slice
is missing from the candidate files, we can ignore the file slice(in other
words, every file in this file slice did not match the predicate from col
stats, we are safe to ignore the entire file slice). Even if one file is
present in candidate files, we need to include the file slice in its entirety.
Why do we need to re-add the files missing to be indexed from col stats(step
3). We know there are 2 cases in which this could legitimately happen. For eg,
log files from failed commit and rollback blocks. We can ignore these files and
only do pruning based on rest of the files in the file slice. For eg, if we
have a base file and 5 log files(out of which one is a rollback block) in a
file slice, if the base file and 4 log files did not match the predicate, we
should skip the file slice. but as of now, we can't skip this. In summary, as
per current logic, if a file slice has either of these (rollback block, delete
block, and data blocks from failed commit), it can never be filtered out w/ col
stats based pruning. We should definitely revisit this and fix it as much as
possible. For delete blocks also, I am thinking if we can do the same. i.e. on
the write path, we can skip adding the entries to col stats. and then while
pruning only consider files w/ valid stats to prune a file slice. For eg, we
have a base file and 3 log files, out of which one of them is a delete block.
We do stats based pruning for base file and 3 log files. If all of them did not
match, should we filter out the entire file slice? or do we give a benefit of
doubt and include it (which is what we do as of today)?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)