nooneuse opened a new pull request, #64891:
URL: https://github.com/apache/doris/pull/64891
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
- This PR introduces **Multi-Stage Predicate Lazy Materialization
(multi-stage predicate LM)** in the BE storage scan path (`SegmentIterator`) to
reduce unnecessary predicate-column reads and predicate evaluation cost.
- In many OLAP queries with multiple conjuncts, evaluating all predicates
requires reading multiple predicate columns up-front. When some predicate
columns are expensive to read/decode, this can cause avoidable I/O and CPU
overhead even if many rows could be filtered out early.
- Multi-stage predicate LM splits predicate evaluation into two phases:
- **Stage1** evaluates a selected subset of predicate columns first for
coarse filtering.
- **Stage2** evaluates the remaining predicate columns only on the
surviving rows, using either:
- **by-rowids** (selective Stage1 → read late predicate columns by
rowids to reduce rows read), or
- **by-all-rows** (high survival ratio → scan all rows for late
predicate columns to avoid excessive random reads).
- The feature is **manually controlled via session variables / per-statement
overrides** (hint-like behavior), enabling users to experiment and validate
benefits with profile metrics before any future cost-based automation.
### Release note
- Added **multi-stage predicate lazy materialization** in BE
`SegmentIterator`, splitting predicate evaluation into Stage1/Stage2 and
exposing profile counters (e.g. `PredicateLMStage1InputRows`,
`PredicateLMStage2ByRowIdsBatches`, `PredicateLMStage2ByAllRowsBatches`).
- Added/exposed session variables:
- `enable_multi_stage_predicate_lm` (bool): enable/disable multi-stage
predicate LM.
- `predicate_lm_stage1_cols` (string): optionally specify Stage1 predicate
columns.
- `predicate_lm_stage1_survival_ratio_threshold` (double): threshold to
choose Stage2 strategy (by-rowids vs by-all-rows).
- Enhanced `predicate_lm_stage1_cols` parsing and scoping:
- Tolerates whitespace/backticks/duplicates.
- Supports scoped identifiers `col` / `table.col` / `db.table.col` to
target a specific scan in multi-table queries.
- Uses scan context (db/table name; rollup suffix handling) for accurate
scoped matching.
- Improved robustness and safer defaults:
- Unknown columns or non-matching scoped tokens in
`predicate_lm_stage1_cols` are **ignored** instead of failing the query,
avoiding multi-table query failures caused by schema differences.
- When there is **no runtime filter** and the effective Stage1 column list
is empty, the scan **falls back to single-stage predicate evaluation**
(equivalent to multi-stage disabled for that scan) to avoid unpredictable “pick
an arbitrary predicate column as Stage1” behavior and potential regressions.
- Extended scan-node thrift payload to carry `db_name` for accurate
`db.table.col` matching without changing `table_name` formatting; FE populates
`db_name` and BE consumes it.
- Updated BE unit tests, regression tests, and documentation accordingly.
### Detailed Description
# Multi-Stage Predicate Lazy Materialization
## 1. Overview (Summary)
Multi-Stage Predicate Lazy Materialization (multi-stage predicate LM) is a
storage-layer scan optimization.
It splits “predicate column reading + predicate evaluation” into **two
stages (Stage1/Stage2)**. By “reading only a small subset of predicate columns
first for coarse filtering, then evaluating the remaining predicates on the
surviving rows”, it reduces unnecessary column reads and computation overhead.
At the current stage, this feature is **manually configured**. It is
intended to behave like a hint to influence execution behavior. In the future
PR, the FE will leverage statistics to automatically choose suitable columns.
Key goals:
- Reduce total predicate-column I/O (especially when some columns are
expensive to read/deserialize, or there are many predicate columns)
- Reduce predicate evaluation cost (vectorized evaluation / short-circuit
evaluation)
- Reduce I/O in cases where predicates are selective
### Behavior at a Glance
- **Stage1**: Read the configured stage1 predicate columns (plus any
required delete-condition columns, runtime filters, etc.) and perform the first
round of filtering.
- **Stage2**: Perform the second round of filtering using the “remaining
predicate columns” that are not included in Stage1 (late predicate columns).
Stage2 has two strategies:
- **by-rowids**: If the survival ratio after Stage1 is low, read Stage2
predicate columns only for the rowids produced by Stage1 (random/rowid-based
reads) to minimize the number of rows read.
- **by-all-rows**: If the survival ratio after Stage1 is high, read Stage2
predicate columns for all rows (sequential/index-order reads) to avoid the seek
overhead caused by many random rowid reads.
---
## 2. Usage and Configuration (with SQL Cases)
### 2.1 Prerequisites
- Note: This feature works on the **BE storage scan path** (SegmentIterator)
and is typically controlled by FE session variables forwarded to BE.
- For stronger validation or observability, you can enable profiling:
- `set enable_profile=true;`
- Use `EXPLAIN` / `PROFILE` / `show profile` to inspect metrics.
### 2.2 Configuration Options
#### 2.2.1 `enable_multi_stage_predicate_lm`
- Type: bool
- Default: false
- Purpose: Enable/disable multi-stage predicate LM.
- Example:
- `set enable_multi_stage_predicate_lm = true;`
#### 2.2.2 `predicate_lm_stage1_cols`
- Type: string
- Default: empty string (meaning: Stage1 columns are not explicitly
specified)
- Purpose: Specify which columns’ predicates should be evaluated in Stage1
(coarse filtering).
- Notes:
- The value is a comma-separated column list string, e.g.:
- `a`
- `a,b`
- ` a ,``b``, a ` (whitespace/backticks/duplicates are allowed)
- It also supports **scoping to a specific table / db.table** to precisely
select which scan operator should apply the stage1 column configuration in
multi-table queries:
- `table.col`
- `db.table.col`
- Important:
- Here `table` / `db.table` refers to the **real base table name**, not
a SQL alias.
- If a scoped token does not match the current scan (e.g., the scan is
on `t1` but the config includes `t2.a`), that token will be ignored.
- If a specified column does not exist (or does not exist in the current
scan table schema), **it will not fail**; the token will be ignored (to avoid
multi-table queries failing due to schema differences).
- Default behavior when `predicate_lm_stage1_cols` is empty (or becomes
empty after ignoring invalid / non-matching tokens):
- If runtime filter predicate columns are available for the scan,
SegmentIterator will use those runtime filter columns as Stage1.
- If there is no runtime filter column available, the implementation
**falls back to single-stage predicate evaluation** (equivalent to
`enable_multi_stage_predicate_lm=false` for that scan). In this case, Stage2
will not happen.
- If you want multi-stage behavior for queries without runtime filters,
explicitly configure `predicate_lm_stage1_cols` with at least one valid column
for the target scan.
- Examples:
- `set predicate_lm_stage1_cols = 'a';`
- `set predicate_lm_stage1_cols = ' a ,``b``, a ';`
- `set predicate_lm_stage1_cols = 'lineitem.l_shipdate';`
- `set predicate_lm_stage1_cols = 'tpch.lineitem.l_shipdate';`
#### 2.2.3 `predicate_lm_stage1_survival_ratio_threshold`
- Type: double
- Default: 0.8 (storage-side default threshold)
- Purpose: The Stage1 survival ratio threshold used to decide the Stage2
strategy:
- `survival_ratio <= threshold` → prefer `stage2-by-rowids`
- `survival_ratio > threshold` → prefer `stage2-by-all-rows`
---
### 2.3 Recommended Validation (via Profile Metrics)
In the `SegmentIterator` block of the profile, focus on:
- `PredicateLMStage1InputRows`: number of input rows to Stage1
- `PredicateLMStage1OutputRows`: number of output rows from Stage1 (rows
surviving Stage1 filtering)
- `PredicateLMStage2ByRowIdsBatches`: number of batches where Stage2 was
triggered in by-rowids mode
- `PredicateLMStage2ByAllRowsBatches`: number of batches where Stage2 was
triggered in by-all-rows mode
- `PredicateLMStage2RowsRead`: total rows read by Stage2 (semantics differ
between by-rowids vs by-all-rows)
How to tell whether multi-stage predicate LM is enabled and effective:
- `PredicateLMStage1InputRows > 0` indicates the scan entered the Stage1 path
- `PredicateLMStage2ByRowIdsBatches > 0` or
`PredicateLMStage2ByAllRowsBatches > 0` indicates **Stage2 actually happened**
- If both Stage2 batch counters stay at 0, it means Stage2 did not happen
(either because it was not needed, or because the scan fell back to
single-stage behavior)
---
### 2.4 SQL Case Examples (Typical Trigger Paths)
The following examples use table `tbl_multi_stage_predicate_lm(k,a,b)`.
#### Case A: Baseline (Feature Off)
```sql
set enable_multi_stage_predicate_lm = false;
set predicate_lm_stage1_cols = '';
select count(*) from tbl_multi_stage_predicate_lm where a = 1 and b = 2;
```
#### Case B: Feature On + Stage2-by-rowids
Goal: Stage1 has a low survival ratio, so Stage2 reads late predicate
columns by rowids.
```sql
set enable_profile=true;
set enable_multi_stage_predicate_lm = true;
set predicate_lm_stage1_cols = 'a';
-- a=1 is selective, Stage1 survival ratio is low, prefer stage2-by-rowids
select /* rowids_case */ count(*)
from tbl_multi_stage_predicate_lm
where a = 1 and b = 2;
```
Expected observation:
- `PredicateLMStage2ByRowIdsBatches > 0`
#### Case C: Feature On + Stage2-by-all-rows
Goal: Stage1 has a high survival ratio, so Stage2 reads late predicate
columns for all rows.
```sql
set enable_profile=true;
set enable_multi_stage_predicate_lm = true;
set predicate_lm_stage1_cols = 'a';
-- Stage1 has a high survival ratio (e.g. 95%), prefer stage2-by-all-rows
select /* allrows_case */ count(*)
from tbl_multi_stage_predicate_lm
where a < 19 and b = 2;
```
Expected observation:
- `PredicateLMStage2ByAllRowsBatches > 0`
#### Case D: Scoped to Table / DB.Table (Recommended for Multi-Table Queries)
```sql
-- Only applies to column `col` of table `table_name`
set predicate_lm_stage1_cols = 'table_name.col';
-- Only applies to column `col` of table `db_name.table_name`
set predicate_lm_stage1_cols = 'db_name.table_name.col';
```
#### Case E: Invalid Column / Mismatched Scope Will Not Fail (Ignored)
Rationale: To avoid multi-table queries failing due to schema differences,
invalid column names or mismatched scoped tokens will be ignored.
```sql
-- Non-existing column: no error; token will be ignored
set predicate_lm_stage1_cols = 'not_exist';
-- Scoped to another table: no error; token will be ignored
set predicate_lm_stage1_cols = 'other_table.a';
select count(*) from tbl_multi_stage_predicate_lm where a = 1 and b = 2;
```
Note:
- If the effective Stage1 column list becomes empty after ignoring invalid /
non-matching tokens, and there is no runtime filter column available, the scan
will fall back to single-stage predicate evaluation (Stage2 will not happen).
Use profile metrics to confirm whether Stage2 is triggered.
---
## 3. Applicable Scenarios
This feature is best suited for scenarios below (the more conditions are
met, the more likely you will see gains):
- Many predicate columns (many AND conjuncts), and some of those columns are
expensive to read/decode
- You can choose one or more “cheap and highly selective” columns as Stage1
(e.g., highly selective equality predicates, low-cost types)
- Wide tables with many non-predicate columns, and lazy materialization is
enabled/beneficial (filter first, then read non-predicate columns)
- Queries have meaningful filtering (not a full scan or a high-survival scan
with little filtering)
- Runtime filters can serve as fast Stage1 filters (depending on
implementation strategy)
Recommendations before enabling broadly:
- Enable `enable_profile=true` and verify that the `SegmentIterator` block
metrics `PredicateLMStage1*` / `PredicateLMStage2*` are actually hit.
- Run a one-time A/B profile comparison on critical queries to confirm the
effect.
Typical example:
- `WHERE a = const AND (b = const OR c IN (...)) AND d > const ...`
where `a` is highly selective and cheap, making it a good Stage1 candidate.
---
## 4. Risks and Notes
### 4.1 Performance Regression Risk (Misconfiguration)
- If you put “low-selectivity / high-cost” columns into Stage1, Stage1
read/eval cost may increase and offset gains (or even regress).
- If Stage1 survival ratio stays high but Stage2 is still frequently
triggered, total read volume may approach or exceed the baseline (especially
with stage2-by-all-rows).
Recommendations:
- Prefer **high-selectivity, low read-cost** columns in Stage1.
- Use A/B profiling on common query patterns before finalizing the stage1
column strategy.
### 4.2 Random Read Risk (Misconfiguration)
- When stage2-by-rowids is chosen, rowid-based reads may increase seeks and
can be unfriendly under certain column/page layouts.
- This is why `survival_ratio_threshold` exists: when survival ratio is
high, prefer all-rows reads to avoid random I/O.
Recommendations:
- Tune the threshold reasonably and validate with real data distributions.
### 4.3 Multi-Table Query Considerations (Stage1 Column Scope)
- If a query scans multiple internal OLAP tables:
- Using an unqualified column name (e.g. `a`) may affect multiple scans at
once (if those tables share the same column name).
- Prefer `table.col` / `db.table.col` to scope the configuration to the
target table and avoid unintended effects.
- If a query joins with External Catalog tables:
- The external scan path does not support multi-stage predicate LM, so
`predicate_lm_stage1_cols` does not take effect on the external scan.
### 4.4 Silent Misconfiguration Risk
- Invalid column names or non-matching scoped tokens are ignored (no error).
This avoids query failures but may also cause the configuration to have no
effect.
- If the effective Stage1 column list becomes empty and there is no runtime
filter column available, the scan may fall back to single-stage behavior.
Recommendations:
- Enable profiling and verify Stage1/Stage2 metrics to confirm the feature
is effective for the target query.
- For queries without runtime filters, explicitly configure
`predicate_lm_stage1_cols` with at least one valid column for the target scan.
---
## 5. Performance Data
WIP
### Check List (For Author)
- Test <!-- At least one of them must be included. -->
- [ ] Regression test
- [ ] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason <!-- Add your reason? -->
- Behavior changed:
- [ ] No.
- [ ] Yes. <!-- Explain the behavior change -->
- Does this need documentation?
- [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
https://github.com/apache/doris-website/pull/1214 -->
### Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR should
merge into -->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]