Tan-JiaLiang commented on PR #6028: URL: https://github.com/apache/paimon/pull/6028#issuecomment-3155097110
@Zouxxyy Thank you for raising this question. Others are also welcome to ask questions at any time. I would like to answer this question based on my understanding and discuss the validity of this PR. 1. In the current design, there is one DataFile per FileIndex, so perform topk/bottomk by the FileIndex is applied to a single DataFile. 2. Like the deletion-vector, the range-bitmap file index is also marks the row position, so there is no false positive problem. 3. As with the WHERE condition, we should not fully pushdown the TopN predicate, because the file index may not exist in runtime even if the index has been set in the properties. If the FileIndex exists, at most N records will be returned by a single DataFile; if not, the entire DataFile will be read. 4. After reading all the DataFiles, the egine will perform Global TopN filtering and return the results. In shorts, in my opinion, TopN filter can be divided into Local TopN Filter and Global TopN Filter, and the FileIndex can make the Local TopN FIlter's work better, because we can pushdown the FileIndex result to filter the Parquet file's Rowgroup and Page. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@paimon.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org