xiarixiaoyao commented on pull request #3330:
URL: https://github.com/apache/hudi/pull/3330#issuecomment-913405157
friendly ping @vinothchandar @pengzhiwei2018 @leesf . Could you please
take a look in your spare time? Thanks!
update this pr:
delete optimize method for zorder, only introduce a new sort for
z-order/hilbert which can be used by SparkSortAndSizeExecutionStrategy
// Add description for simple data skipping of zorder
- save statistics info to index table
1. index table saved as parquet table, which named by current commitTime.
for example if we do zorder cluster on commitTime 20210721123456
the index table will saved in .hoodie/.zindex/20210721123456
2. update
we only update the index table, when user do zorder cluster again. No need
to update the index table every time you write.
update old index table by full out join, and save the updated table into a
new index table based on commitTime. the update method is more like hudi cow
table.
3. Fault-tolerant
index Table name and Hudi's commit are strongly bound
Therefore, if the cluster operation fails, the generated index table will
also be invalid.
and the Residual files will be delete by index table Cleanup mechanism
4. expired index table/(Residual table) Cleanup mechanism
the clean operation, will be trigger when we try to save index table.
before we save statistic info to index table. we will Check if you need to
clean up expired index table
index Table name and Hudi's commit are strongly bound.
step1: Find all valid index tables by hudi table's validateCommits as
candidateTables;
step2: delte all the index tables which not contains in candidateTables.
step3: find the newest index table from candidateTables and delete other
tables. we only keep the lastest index table.
- read index table to do fileSkipping
step1: find all hudi table's validateCommits
step2: Find all valid index tables by validateCommits as candidateTables;
step3: choose the latest index table as final index table.
step4: convert all the pushed filters from query to new z_index filters and
push those filters to index table to produce all candidated file sets.
step5: Use the candidate file sets generated in the fourth step to calculate
all the filtered file sets which need to be filterd.
we choose filterd file sets instead of candidated file sets.
those filtered filtes sets must not contain any query data, so it is safe
to use those files to do data skipping.
step6: use the filtered files in step5 do file skipping in
HoodieFileIndex.listFiles
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]