Larborator opened a new issue, #63854:
URL: https://github.com/apache/doris/issues/63854

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Version
   
   4.x
   
   ### What's Wrong?
   
     `OlapScanNode.distributionPrune` returns the entire 
`nereidsPrunedTabletIds` set whenever the query goes through Nereids. The 
caller `OlapScanNode.computeTabletInfo` invokes it inside a per-partition loop:
   
     ```java
     // fe-core/src/main/java/org/apache/doris/planner/OlapScanNode.java
     for (Long partitionId : selectedPartitionIds) {
         final MaterializedIndex selectedTable = 
partition.getIndex(selectedIndexId);
         Collection<Long> prunedTabletIds = distributionPrune(..., true);
         //   ↑ returns the GLOBAL nereidsPrunedTabletIds set, copied into a 
new ArrayList,
         //     not just tablets belonging to this partition
   
         if (prunedTabletIds != null) {
             for (Long id : prunedTabletIds) {
                 if (selectedTable.getTablet(id) != null) {       // hash lookup
                     tablets.add(selectedTable.getTablet(id));    // hash 
lookup again
                     scanTabletIds.add(id);
                 }
                 // else: id belongs to another partition; result is null and 
discarded
             }
         }
     }
     ```
   
     Because `nereidsPrunedTabletIds` is the union of **all** selected tablets 
across **all** partitions (set in 
`PhysicalPlanTranslator.visitPhysicalOlapScan`):
   
     ```java
     olapScanNode.setNereidsPrunedTabletIds(
         new LinkedHashSet<>(olapScan.getSelectedTabletIds()));
     ```
   
     each per-partition iteration walks the entire global set and does a 
`MaterializedIndex.getTablet(id)` HashMap lookup on ids that belong to other 
partitions. The vast majority of these lookups return `null` and are discarded, 
but the hash work is already paid.
   
   
     #### Complexity
   
     Let `P = selectedPartitionIds.size()` and `M = 
nereidsPrunedTabletIds.size()`.
   
     - `getTablet` calls: **O(P × M × 2)** instead of O(M × 2)
     - HashSet → ArrayList copies inside `distributionPrune`: **O(P × M)** 
elements
   
     The regression is most pronounced on tables whose distribution layout is 
**many partitions × few buckets per partition** (e.g. fine-grained hourly 
partitioning with 2 buckets per partition), because then `M` ≈ `P × 
bucketsPerPartition`, and the total `getTablet` calls grow as **O(P² × 
bucketsPerPartition)**.
   
     #### Concrete example
   
     A range-partitioned table with **hourly partitions** and **2 buckets per 
partition**, queried with a one-year time range and a bucket-key equality 
predicate that prunes one of the two buckets per partition:
   
     - `P` ≈ 24 × 365 ≈ 8760 partitions
     - `M` ≈ 8760 (one tablet per partition selected after bucket pruning)
     - Total `getTablet` lookups: ≈ 76 million
     - HashSet copies: ≈ 76M element copies
   
     CPU profiling of the FE plan thread:
   
     ```
     OlapScanNode.computeTabletInfo                      83%
       ├─ MaterializedIndex.getTablet -> HashMap.getNode 54%
       ├─ ArrayList.<init> -> HashSet.toArray            17%
       └─ OlapScanNode.addScanRangeLocations              2%
     ```
   
     In practice this means a query that planned in ~0.1 s before #53403 now 
spends ~2.7 s in `Nereids Translate Time` for the same SQL on the same data.
   
   
   ### What You Expected?
   
     `distributionPrune` under the Nereids path should only return tablets 
belonging to the partition currently being processed, matching the semantics of 
the non-Nereids `HashDistributionPruner` path. Plan time should not depend 
quadratically on partition count.
   
   ### How to Reproduce?
   
     1. Create a range-partitioned table with **many partitions** and **few 
buckets per partition**, e.g.:
   
        ```sql
        CREATE TABLE t (
            event_time DATETIME,
            user_id    BIGINT,
            v          BIGINT
        )
        DUPLICATE KEY(event_time, user_id)
        PARTITION BY RANGE(event_time) (...)              -- hourly, ~8000 
partitions
        DISTRIBUTED BY HASH(user_id) BUCKETS 2;
        ```
   
     2. Run a query that selects most partitions plus a bucket-key predicate 
that prunes most of the buckets per partition:
   
        ```sql
        SET enable_profile = true;
        EXPLAIN SELECT * FROM t
          WHERE event_time >= '2025-01-01' AND event_time < '2026-01-01'
            AND user_id = 12345;
        ```
   
     3. `SHOW QUERY PROFILE "/<query_id>"` and look at `Nereids Translate 
Time`. Compare against a build prior to #53403, or against an equivalent table 
whose partition count is small.
   
     4. Optional — attach an async-profiler CPU profile to the FE process 
during a stress run; the hottest stack will be `MaterializedIndex.getTablet -> 
HashMap.getNode` under `OlapScanNode.computeTabletInfo`.
   
   
   ### Anything Else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to