Larborator opened a new issue, #63854: URL: https://github.com/apache/doris/issues/63854
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues. ### Version 4.x ### What's Wrong? `OlapScanNode.distributionPrune` returns the entire `nereidsPrunedTabletIds` set whenever the query goes through Nereids. The caller `OlapScanNode.computeTabletInfo` invokes it inside a per-partition loop: ```java // fe-core/src/main/java/org/apache/doris/planner/OlapScanNode.java for (Long partitionId : selectedPartitionIds) { final MaterializedIndex selectedTable = partition.getIndex(selectedIndexId); Collection<Long> prunedTabletIds = distributionPrune(..., true); // ↑ returns the GLOBAL nereidsPrunedTabletIds set, copied into a new ArrayList, // not just tablets belonging to this partition if (prunedTabletIds != null) { for (Long id : prunedTabletIds) { if (selectedTable.getTablet(id) != null) { // hash lookup tablets.add(selectedTable.getTablet(id)); // hash lookup again scanTabletIds.add(id); } // else: id belongs to another partition; result is null and discarded } } } ``` Because `nereidsPrunedTabletIds` is the union of **all** selected tablets across **all** partitions (set in `PhysicalPlanTranslator.visitPhysicalOlapScan`): ```java olapScanNode.setNereidsPrunedTabletIds( new LinkedHashSet<>(olapScan.getSelectedTabletIds())); ``` each per-partition iteration walks the entire global set and does a `MaterializedIndex.getTablet(id)` HashMap lookup on ids that belong to other partitions. The vast majority of these lookups return `null` and are discarded, but the hash work is already paid. #### Complexity Let `P = selectedPartitionIds.size()` and `M = nereidsPrunedTabletIds.size()`. - `getTablet` calls: **O(P × M × 2)** instead of O(M × 2) - HashSet → ArrayList copies inside `distributionPrune`: **O(P × M)** elements The regression is most pronounced on tables whose distribution layout is **many partitions × few buckets per partition** (e.g. fine-grained hourly partitioning with 2 buckets per partition), because then `M` ≈ `P × bucketsPerPartition`, and the total `getTablet` calls grow as **O(P² × bucketsPerPartition)**. #### Concrete example A range-partitioned table with **hourly partitions** and **2 buckets per partition**, queried with a one-year time range and a bucket-key equality predicate that prunes one of the two buckets per partition: - `P` ≈ 24 × 365 ≈ 8760 partitions - `M` ≈ 8760 (one tablet per partition selected after bucket pruning) - Total `getTablet` lookups: ≈ 76 million - HashSet copies: ≈ 76M element copies CPU profiling of the FE plan thread: ``` OlapScanNode.computeTabletInfo 83% ├─ MaterializedIndex.getTablet -> HashMap.getNode 54% ├─ ArrayList.<init> -> HashSet.toArray 17% └─ OlapScanNode.addScanRangeLocations 2% ``` In practice this means a query that planned in ~0.1 s before #53403 now spends ~2.7 s in `Nereids Translate Time` for the same SQL on the same data. ### What You Expected? `distributionPrune` under the Nereids path should only return tablets belonging to the partition currently being processed, matching the semantics of the non-Nereids `HashDistributionPruner` path. Plan time should not depend quadratically on partition count. ### How to Reproduce? 1. Create a range-partitioned table with **many partitions** and **few buckets per partition**, e.g.: ```sql CREATE TABLE t ( event_time DATETIME, user_id BIGINT, v BIGINT ) DUPLICATE KEY(event_time, user_id) PARTITION BY RANGE(event_time) (...) -- hourly, ~8000 partitions DISTRIBUTED BY HASH(user_id) BUCKETS 2; ``` 2. Run a query that selects most partitions plus a bucket-key predicate that prunes most of the buckets per partition: ```sql SET enable_profile = true; EXPLAIN SELECT * FROM t WHERE event_time >= '2025-01-01' AND event_time < '2026-01-01' AND user_id = 12345; ``` 3. `SHOW QUERY PROFILE "/<query_id>"` and look at `Nereids Translate Time`. Compare against a build prior to #53403, or against an equivalent table whose partition count is small. 4. Optional — attach an async-profiler CPU profile to the FE process during a stress run; the hottest stack will be `MaterializedIndex.getTablet -> HashMap.getNode` under `OlapScanNode.computeTabletInfo`. ### Anything Else? _No response_ ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
