xiarixiaoyao commented on pull request #3330:
URL: https://github.com/apache/hudi/pull/3330#issuecomment-913405157


     friendly ping @vinothchandar @pengzhiwei2018  @leesf  . Could you please 
take a look in your spare time? Thanks!
   update this pr:
   delete optimize method for zorder, only introduce a new sort for 
z-order/hilbert which can be used by SparkSortAndSizeExecutionStrategy
   
   // Add description for simple data skipping of zorder
   
   - save statistics info to index table
   1. index table saved as parquet table, which named by current commitTime.
   for example if we do zorder cluster on commitTime 20210721123456
   the index table will saved in .hoodie/.zindex/20210721123456
   
   2. update 
   we only update the index table, when user do zorder cluster again. No need 
to update the index table every time  you write.
   update old index table by full out join, and save the updated table into a 
new index table based on commitTime. the update method is more like hudi cow 
table.
   
   3. Fault-tolerant
   index Table name and Hudi's commit are strongly bound
   Therefore, if the cluster operation fails, the generated index table will 
also be invalid.
   and the Residual files will be delete by index table Cleanup mechanism
   
   4. expired index table/(Residual table) Cleanup mechanism
   the clean operation, will be trigger when we try to save index table.
   before we save statistic info to index table. we will  Check if you need to 
clean up expired index table
   index Table name and Hudi's commit are strongly bound. 
   step1: Find all valid index tables by hudi table's validateCommits as 
candidateTables;
   step2: delte all the index tables which not contains in candidateTables.
   step3: find the newest index table from candidateTables and delete other 
tables. we only keep the lastest index table.
   - read index table to do fileSkipping
   step1: find all hudi table's validateCommits
   step2: Find all valid index tables by validateCommits as candidateTables;
   step3: choose the latest index table as final index table.
   step4: convert all the pushed filters from query to new z_index filters and 
push those filters to index table to produce all candidated file sets. 
   step5: Use the candidate file sets generated in the fourth step to calculate 
all the filtered file sets which need to be filterd. 
     we choose filterd file sets instead of candidated file sets. 
     those filtered filtes sets must not contain any query data, so it is safe 
to use those files to do data skipping.
   step6: use the filtered files in step5 do file skipping in 
HoodieFileIndex.listFiles


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to