wenzhenghu opened a new pull request, #64332:
URL: https://github.com/apache/doris/pull/64332
### What problem does this PR solve?
In ETL scenarios, after creating a large table (e.g. 262K rows) via CTAS or
INSERT INTO SELECT, it is immediately joined with a known small table (e.g. 10
rows). Because auto-analyze has not yet completed, the FE optimizer cannot
obtain the row count of the new table and falls back to 1, causing the large
table to be incorrectly chosen as the broadcast (replicated) side. This leads
to excessive memory usage and query cancellation.
The root cause chain: after CTAS/INSERT INTO SELECT becomes VISIBLE, the new
table has no `TableStatsMeta`. `StatsCalculator.getOlapTableRowCount()`
receives `-1` and is clamped by `Math.max(1, -1)` to `1`. If the small table
has been analyzed and has a known row count (e.g. 10), the broadcast cost model
considers `1 < 10` and broadcasts the large table.
### Solution
After CTAS/INSERT INTO SELECT transaction becomes VISIBLE, bootstrap a
minimal `TableStatsMeta` that contains only table-level and base-index row
count, without any column statistics. This allows the optimizer to consume the
row count for correct broadcast-side selection.
Core changes:
- `TableStatsMeta.newBootstrapStats()`: creates a `TableStatsMeta` with only
`rowCount`, `updatedRows`, and base index `indexesRowCount`. Does not set
`userInjected` and does not interfere with subsequent auto-analyze scheduling.
- `AnalysisManager.bootstrapTableStatsIfAbsent()`: double-checked locking,
only writes when no `TableStatsMeta` exists and `loadedRows > 0`.
- `OlapInsertExecutor`: invokes bootstrap after the transaction reaches
VISIBLE status.
- `ShowTableStatsCommand`: adds null guard for `jobType`, as bootstrap stats
have no associated analyze job.
### New Session Variable
`enable_insert_select_table_stats_bootstrap` (default `false`, EXPERIMENTAL)
Usage:
```sql
SET enable_insert_select_table_stats_bootstrap = true;
CREATE TABLE target_table AS SELECT ... FROM large_source;
-- or
INSERT INTO target_table SELECT ... FROM large_source;
-- After the statement returns, SHOW TABLE STATS shows the row count,
-- and the optimizer can use it for correct broadcast-side selection.
```
### Check List
- Test:
- FE Unit Test:
- `TableStatsMetaTest.testNewBootstrapStatsSeedsBaseIndexRowCount` —
verifies bootstrap metadata field correctness
-
`OlapInsertExecutorTest.testExecuteSingleInsertVisibleBootstrapsTableStatsWhenAbsent`
— verifies bootstrap takes effect when enabled
-
`OlapInsertExecutorTest.testExecuteSingleInsertVisibleDoesNotBootstrapTableStatsWhenDisabled`
— verifies no bootstrap when disabled (default)
-
`ShowTableStatsCommandTest.testConstructTableResultSetForBootstrapStats` —
verifies `SHOW TABLE STATS` renders bootstrap metadata without NPE
- Regression Test: `insert_select_table_stats_bootstrap.groovy` —
two-phase assertions: when disabled, `stats=1` and large table is broadcast;
when enabled, `stats=262,144` and small table is broadcast. Ran 10 consecutive
times on a remote Doris instance, all passed.
- Manual Test: verified on a deployed remote Doris instance with the
latest code.
- Behavior changed: No (disabled by default, no impact on existing behavior)
- Does this need documentation: Yes (new session variable)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]