Re: [PR] update limit pushdown to tablet reader [doris]

via GitHub Mon, 30 Mar 2026 02:39:29 -0700


github-actions[bot] commented on code in PR #61713:
URL: https://github.com/apache/doris/pull/61713#discussion_r3008659839



##########
regression-test/suites/query_p0/limit/test_general_limit_pushdown.groovy:
##########
@@ -0,0 +1,135 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Test general limit pushdown to storage layer for DUP_KEYS and UNIQUE_KEYS 
(MOW).
+// This exercises the non-topn limit path where VCollectIterator enforces
+// general_read_limit with filter_block_conjuncts applied before counting.
+
+suite("test_general_limit_pushdown") {
+

Review Comment:
   **[Test Coverage Gaps]** The tests are solid for basic scenarios but could 
be improved:
   
   1. **Missing negative test for AGG_KEYS:** Add a test with an `AGG_KEYS` 
table to verify the optimization is NOT applied (correctness under aggregation).
   
   2. **Missing test for MOW with deletes:** Insert rows, DELETE some, then run 
LIMIT query. This verifies the `__DORIS_DELETE_SIGN__` predicate (which is in 
`_conjuncts`) is correctly handled after being moved to 
`filter_block_conjuncts`.
   
   3. **Missing test with multiple buckets/tablets:** Current tests use 
`BUCKETS 1`. With multiple buckets, multiple scanners are involved, exercising 
the per-scanner limit vs global limit coordination.
   
   4. **Missing MOR (merge-on-read) UNIQUE_KEYS test:** Add a test with 
`enable_unique_key_merge_on_write = false` to verify the optimization is NOT 
applied for MOR tables.
   
   5. **Missing LIMIT + OFFSET test:** `SELECT ... LIMIT 10 OFFSET 5` -- verify 
this interacts correctly.



##########
be/src/storage/iterator/vcollect_iterator.cpp:
##########
@@ -248,8 +255,39 @@ Status VCollectIterator::next(Block* block) {
         return _topn_next(block);
     }
 
+    // Fast path: if general limit already reached, return EOF immediately
+    if (_general_read_limit > 0 && _general_rows_returned >= 
_general_read_limit) {
+        return Status::Error<END_OF_FILE>("");
+    }
+
     if (LIKELY(_inner_iter)) {
-        return _inner_iter->next(block);
+        auto st = _inner_iter->next(block);
+        if (UNLIKELY(!st.ok())) {
+            return st;
+        }
+
+        // Apply filter_block_conjuncts that were moved from 
Scanner::_conjuncts.
+        // This must happen BEFORE limit counting so that 
_general_rows_returned
+        // reflects post-filter rows (same pattern as _topn_next).
+        if (!_reader->_reader_context.filter_block_conjuncts.empty()) {
+            RETURN_IF_ERROR(VExprContext::filter_block(

Review Comment:
   **[Minor]** The `filter_block_conjuncts` application at line 272 is not 
gated by `_general_read_limit > 0`. This means it runs on **every non-topn read 
path** whenever `filter_block_conjuncts` is non-empty.
   
   Currently this is fine because `filter_block_conjuncts` is only populated 
when the general-limit or topn branches are active (and topn takes a separate 
code path via `_topn_next`). However, this is subtly fragile -- if any future 
code populates `filter_block_conjuncts` without also setting 
`general_read_limit` or topn, the filter would unexpectedly apply here without 
the scanner-level `_conjuncts` having been cleared (since the scanner clearing 
is coupled to the limit-setting code).
   
   Consider either:
   1. Gating both filter application and limit enforcement under 
`_general_read_limit > 0`, or
   2. Adding a comment noting that this is intentionally unconditional.



##########
be/src/exec/scan/olap_scanner.cpp:
##########
@@ -491,20 +491,34 @@ Status OlapScanner::_init_tablet_reader_params(
             _tablet_reader_params.enable_mor_value_predicate_pushdown = true;
         }
 
-        // order by table keys optimization for topn
-        // will only read head/tail of data file since it's already sorted by 
keys
-        if (olap_scan_node.__isset.sort_info && 
!olap_scan_node.sort_info.is_asc_order.empty()) {
-            _limit = _local_state->limit_per_scanner();
-            _tablet_reader_params.read_orderby_key = true;
-            if (!olap_scan_node.sort_info.is_asc_order[0]) {
-                _tablet_reader_params.read_orderby_key_reverse = true;
-            }
-            _tablet_reader_params.read_orderby_key_num_prefix_columns =
-                    olap_scan_node.sort_info.is_asc_order.size();
-            _tablet_reader_params.read_orderby_key_limit = _limit;
-
-            if (_tablet_reader_params.read_orderby_key_limit > 0 &&
-                olap_scan_local_state->_storage_no_merge()) {
+        // Skip topn / general-limit storage-layer optimizations when runtime
+        // filters exist.  Late-arriving filters would re-populate _conjuncts
+        // at the scanner level while the storage layer has already committed
+        // to a row budget counted before those filters, causing the scan to
+        // return fewer rows than the limit requires.
+        if (_total_rf_num == 0) {
+            // order by table keys optimization for topn
+            // will only read head/tail of data file since it's already sorted 
by keys
+            if (olap_scan_node.__isset.sort_info &&
+                !olap_scan_node.sort_info.is_asc_order.empty()) {
+                _limit = _local_state->limit_per_scanner();
+                _tablet_reader_params.read_orderby_key = true;
+                if (!olap_scan_node.sort_info.is_asc_order[0]) {
+                    _tablet_reader_params.read_orderby_key_reverse = true;
+                }
+                _tablet_reader_params.read_orderby_key_num_prefix_columns =
+                        olap_scan_node.sort_info.is_asc_order.size();
+                _tablet_reader_params.read_orderby_key_limit = _limit;
+
+                if (_tablet_reader_params.read_orderby_key_limit > 0 &&
+                    olap_scan_local_state->_storage_no_merge()) {
+                    _tablet_reader_params.filter_block_conjuncts = _conjuncts;
+                    _conjuncts.clear();
+                }
+            } else if (_limit > 0 && 
olap_scan_local_state->_storage_no_merge()) {
+                // General limit pushdown for DUP_KEYS and UNIQUE_KEYS with MOW

Review Comment:
   **[Suggestion]** `_limit` here is the full query limit (`TPlanNode.limit`), 
not a per-scanner limit. With N scanners, each scanner will read up to `_limit` 
rows from storage, so the system reads up to `N * _limit` rows total before the 
`_shared_scan_limit` coordinator cuts them off.
   
   This is correct for correctness (the `_shared_scan_limit` atomic counter at 
the scheduler level ensures we never return more than `_limit` rows globally), 
but it's suboptimal for efficiency. For example, with `LIMIT 10` and 100 
scanners, each scanner reads up to 10 rows = 1000 rows total, when only 10 are 
needed.
   
   Consider adding a comment noting this tradeoff, or consider using a smaller 
per-scanner budget (e.g., `_limit / num_scanners + 1`) if feasible.



##########
be/src/exec/scan/olap_scanner.cpp:
##########
@@ -491,20 +491,34 @@ Status OlapScanner::_init_tablet_reader_params(
             _tablet_reader_params.enable_mor_value_predicate_pushdown = true;
         }
 
-        // order by table keys optimization for topn
-        // will only read head/tail of data file since it's already sorted by 
keys
-        if (olap_scan_node.__isset.sort_info && 
!olap_scan_node.sort_info.is_asc_order.empty()) {
-            _limit = _local_state->limit_per_scanner();
-            _tablet_reader_params.read_orderby_key = true;
-            if (!olap_scan_node.sort_info.is_asc_order[0]) {
-                _tablet_reader_params.read_orderby_key_reverse = true;
-            }
-            _tablet_reader_params.read_orderby_key_num_prefix_columns =
-                    olap_scan_node.sort_info.is_asc_order.size();
-            _tablet_reader_params.read_orderby_key_limit = _limit;
-
-            if (_tablet_reader_params.read_orderby_key_limit > 0 &&
-                olap_scan_local_state->_storage_no_merge()) {
+        // Skip topn / general-limit storage-layer optimizations when runtime
+        // filters exist.  Late-arriving filters would re-populate _conjuncts
+        // at the scanner level while the storage layer has already committed
+        // to a row budget counted before those filters, causing the scan to
+        // return fewer rows than the limit requires.
+        if (_total_rf_num == 0) {
+            // order by table keys optimization for topn

Review Comment:
   **[Behavioral Change]** This `_total_rf_num == 0` guard now also wraps the 
**existing topn** optimization, which previously ran unconditionally. This 
means any query with runtime filters will no longer get topn pushdown either.
   
   This is a significant behavioral regression for topn queries in the presence 
of runtime filters (e.g., hash join build side producing filters for the scan 
side). The original topn code worked correctly with runtime filters because the 
topn path uses `limit_per_scanner()` (from FE's `sort_limit`) rather than a 
filter-dependent row budget.
   
   Consider either:
   1. Keeping the `_total_rf_num == 0` guard only on the new general-limit path 
(not the topn path), or
   2. Providing evidence that topn + runtime filters is actually unsafe (which 
I could not confirm from the code).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] update limit pushdown to tablet reader [doris]

Reply via email to