github-actions[bot] commented on code in PR #61713: URL: https://github.com/apache/doris/pull/61713#discussion_r3008659839
########## regression-test/suites/query_p0/limit/test_general_limit_pushdown.groovy: ########## @@ -0,0 +1,135 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Test general limit pushdown to storage layer for DUP_KEYS and UNIQUE_KEYS (MOW). +// This exercises the non-topn limit path where VCollectIterator enforces +// general_read_limit with filter_block_conjuncts applied before counting. + +suite("test_general_limit_pushdown") { + Review Comment: **[Test Coverage Gaps]** The tests are solid for basic scenarios but could be improved: 1. **Missing negative test for AGG_KEYS:** Add a test with an `AGG_KEYS` table to verify the optimization is NOT applied (correctness under aggregation). 2. **Missing test for MOW with deletes:** Insert rows, DELETE some, then run LIMIT query. This verifies the `__DORIS_DELETE_SIGN__` predicate (which is in `_conjuncts`) is correctly handled after being moved to `filter_block_conjuncts`. 3. **Missing test with multiple buckets/tablets:** Current tests use `BUCKETS 1`. With multiple buckets, multiple scanners are involved, exercising the per-scanner limit vs global limit coordination. 4. **Missing MOR (merge-on-read) UNIQUE_KEYS test:** Add a test with `enable_unique_key_merge_on_write = false` to verify the optimization is NOT applied for MOR tables. 5. **Missing LIMIT + OFFSET test:** `SELECT ... LIMIT 10 OFFSET 5` -- verify this interacts correctly. ########## be/src/storage/iterator/vcollect_iterator.cpp: ########## @@ -248,8 +255,39 @@ Status VCollectIterator::next(Block* block) { return _topn_next(block); } + // Fast path: if general limit already reached, return EOF immediately + if (_general_read_limit > 0 && _general_rows_returned >= _general_read_limit) { + return Status::Error<END_OF_FILE>(""); + } + if (LIKELY(_inner_iter)) { - return _inner_iter->next(block); + auto st = _inner_iter->next(block); + if (UNLIKELY(!st.ok())) { + return st; + } + + // Apply filter_block_conjuncts that were moved from Scanner::_conjuncts. + // This must happen BEFORE limit counting so that _general_rows_returned + // reflects post-filter rows (same pattern as _topn_next). + if (!_reader->_reader_context.filter_block_conjuncts.empty()) { + RETURN_IF_ERROR(VExprContext::filter_block( Review Comment: **[Minor]** The `filter_block_conjuncts` application at line 272 is not gated by `_general_read_limit > 0`. This means it runs on **every non-topn read path** whenever `filter_block_conjuncts` is non-empty. Currently this is fine because `filter_block_conjuncts` is only populated when the general-limit or topn branches are active (and topn takes a separate code path via `_topn_next`). However, this is subtly fragile -- if any future code populates `filter_block_conjuncts` without also setting `general_read_limit` or topn, the filter would unexpectedly apply here without the scanner-level `_conjuncts` having been cleared (since the scanner clearing is coupled to the limit-setting code). Consider either: 1. Gating both filter application and limit enforcement under `_general_read_limit > 0`, or 2. Adding a comment noting that this is intentionally unconditional. ########## be/src/exec/scan/olap_scanner.cpp: ########## @@ -491,20 +491,34 @@ Status OlapScanner::_init_tablet_reader_params( _tablet_reader_params.enable_mor_value_predicate_pushdown = true; } - // order by table keys optimization for topn - // will only read head/tail of data file since it's already sorted by keys - if (olap_scan_node.__isset.sort_info && !olap_scan_node.sort_info.is_asc_order.empty()) { - _limit = _local_state->limit_per_scanner(); - _tablet_reader_params.read_orderby_key = true; - if (!olap_scan_node.sort_info.is_asc_order[0]) { - _tablet_reader_params.read_orderby_key_reverse = true; - } - _tablet_reader_params.read_orderby_key_num_prefix_columns = - olap_scan_node.sort_info.is_asc_order.size(); - _tablet_reader_params.read_orderby_key_limit = _limit; - - if (_tablet_reader_params.read_orderby_key_limit > 0 && - olap_scan_local_state->_storage_no_merge()) { + // Skip topn / general-limit storage-layer optimizations when runtime + // filters exist. Late-arriving filters would re-populate _conjuncts + // at the scanner level while the storage layer has already committed + // to a row budget counted before those filters, causing the scan to + // return fewer rows than the limit requires. + if (_total_rf_num == 0) { + // order by table keys optimization for topn + // will only read head/tail of data file since it's already sorted by keys + if (olap_scan_node.__isset.sort_info && + !olap_scan_node.sort_info.is_asc_order.empty()) { + _limit = _local_state->limit_per_scanner(); + _tablet_reader_params.read_orderby_key = true; + if (!olap_scan_node.sort_info.is_asc_order[0]) { + _tablet_reader_params.read_orderby_key_reverse = true; + } + _tablet_reader_params.read_orderby_key_num_prefix_columns = + olap_scan_node.sort_info.is_asc_order.size(); + _tablet_reader_params.read_orderby_key_limit = _limit; + + if (_tablet_reader_params.read_orderby_key_limit > 0 && + olap_scan_local_state->_storage_no_merge()) { + _tablet_reader_params.filter_block_conjuncts = _conjuncts; + _conjuncts.clear(); + } + } else if (_limit > 0 && olap_scan_local_state->_storage_no_merge()) { + // General limit pushdown for DUP_KEYS and UNIQUE_KEYS with MOW Review Comment: **[Suggestion]** `_limit` here is the full query limit (`TPlanNode.limit`), not a per-scanner limit. With N scanners, each scanner will read up to `_limit` rows from storage, so the system reads up to `N * _limit` rows total before the `_shared_scan_limit` coordinator cuts them off. This is correct for correctness (the `_shared_scan_limit` atomic counter at the scheduler level ensures we never return more than `_limit` rows globally), but it's suboptimal for efficiency. For example, with `LIMIT 10` and 100 scanners, each scanner reads up to 10 rows = 1000 rows total, when only 10 are needed. Consider adding a comment noting this tradeoff, or consider using a smaller per-scanner budget (e.g., `_limit / num_scanners + 1`) if feasible. ########## be/src/exec/scan/olap_scanner.cpp: ########## @@ -491,20 +491,34 @@ Status OlapScanner::_init_tablet_reader_params( _tablet_reader_params.enable_mor_value_predicate_pushdown = true; } - // order by table keys optimization for topn - // will only read head/tail of data file since it's already sorted by keys - if (olap_scan_node.__isset.sort_info && !olap_scan_node.sort_info.is_asc_order.empty()) { - _limit = _local_state->limit_per_scanner(); - _tablet_reader_params.read_orderby_key = true; - if (!olap_scan_node.sort_info.is_asc_order[0]) { - _tablet_reader_params.read_orderby_key_reverse = true; - } - _tablet_reader_params.read_orderby_key_num_prefix_columns = - olap_scan_node.sort_info.is_asc_order.size(); - _tablet_reader_params.read_orderby_key_limit = _limit; - - if (_tablet_reader_params.read_orderby_key_limit > 0 && - olap_scan_local_state->_storage_no_merge()) { + // Skip topn / general-limit storage-layer optimizations when runtime + // filters exist. Late-arriving filters would re-populate _conjuncts + // at the scanner level while the storage layer has already committed + // to a row budget counted before those filters, causing the scan to + // return fewer rows than the limit requires. + if (_total_rf_num == 0) { + // order by table keys optimization for topn Review Comment: **[Behavioral Change]** This `_total_rf_num == 0` guard now also wraps the **existing topn** optimization, which previously ran unconditionally. This means any query with runtime filters will no longer get topn pushdown either. This is a significant behavioral regression for topn queries in the presence of runtime filters (e.g., hash join build side producing filters for the scan side). The original topn code worked correctly with runtime filters because the topn path uses `limit_per_scanner()` (from FE's `sort_limit`) rather than a filter-dependent row budget. Consider either: 1. Keeping the `_total_rf_num == 0` guard only on the new general-limit path (not the topn path), or 2. Providing evidence that topn + runtime filters is actually unsafe (which I could not confirm from the code). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
