Tan-JiaLiang commented on PR #6028:
URL: https://github.com/apache/paimon/pull/6028#issuecomment-3155097110

   @Zouxxyy Thank you for raising this question. Others are also welcome to ask 
questions at any time.
   
   I would like to answer this question based on my understanding and discuss 
the validity of this PR.
   
   1. In the current design, there is one DataFile per FileIndex, so perform 
topk/bottomk by the FileIndex is applied to a single DataFile.
   
   2. Like the deletion-vector, the range-bitmap file index is also marks the 
row position, so there is no false positive problem.
   
   3. As with the WHERE condition, we should not fully pushdown the TopN 
predicate, because the file index may not exist in runtime even if the index 
has been set in the properties. If the FileIndex exists, at most N records will 
be returned by a single DataFile; if not, the entire DataFile will be read.
   
   4. After reading all the DataFiles, the egine will perform Global TopN 
filtering and return the results.
   
   In shorts, in my opinion, TopN filter can be divided into Local TopN Filter 
and Global TopN Filter, and the FileIndex can make the Local TopN FIlter's work 
better, because we can pushdown the FileIndex result to filter the Parquet 
file's Rowgroup and Page.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@paimon.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to