wenzhenghu opened a new pull request, #64332:
URL: https://github.com/apache/doris/pull/64332

   ### What problem does this PR solve?
   
   In ETL scenarios, after creating a large table (e.g. 262K rows) via CTAS or 
INSERT INTO SELECT, it is immediately joined with a known small table (e.g. 10 
rows). Because auto-analyze has not yet completed, the FE optimizer cannot 
obtain the row count of the new table and falls back to 1, causing the large 
table to be incorrectly chosen as the broadcast (replicated) side. This leads 
to excessive memory usage and query cancellation.
   
   The root cause chain: after CTAS/INSERT INTO SELECT becomes VISIBLE, the new 
table has no `TableStatsMeta`. `StatsCalculator.getOlapTableRowCount()` 
receives `-1` and is clamped by `Math.max(1, -1)` to `1`. If the small table 
has been analyzed and has a known row count (e.g. 10), the broadcast cost model 
considers `1 < 10` and broadcasts the large table.
   
   ### Solution
   
   After CTAS/INSERT INTO SELECT transaction becomes VISIBLE, bootstrap a 
minimal `TableStatsMeta` that contains only table-level and base-index row 
count, without any column statistics. This allows the optimizer to consume the 
row count for correct broadcast-side selection.
   
   Core changes:
   - `TableStatsMeta.newBootstrapStats()`: creates a `TableStatsMeta` with only 
`rowCount`, `updatedRows`, and base index `indexesRowCount`. Does not set 
`userInjected` and does not interfere with subsequent auto-analyze scheduling.
   - `AnalysisManager.bootstrapTableStatsIfAbsent()`: double-checked locking, 
only writes when no `TableStatsMeta` exists and `loadedRows > 0`.
   - `OlapInsertExecutor`: invokes bootstrap after the transaction reaches 
VISIBLE status.
   - `ShowTableStatsCommand`: adds null guard for `jobType`, as bootstrap stats 
have no associated analyze job.
   
   ### New Session Variable
   
   `enable_insert_select_table_stats_bootstrap` (default `false`, EXPERIMENTAL)
   
   Usage:
   
   ```sql
   SET enable_insert_select_table_stats_bootstrap = true;
   
   CREATE TABLE target_table AS SELECT ... FROM large_source;
   -- or
   INSERT INTO target_table SELECT ... FROM large_source;
   
   -- After the statement returns, SHOW TABLE STATS shows the row count,
   -- and the optimizer can use it for correct broadcast-side selection.
   ```
   
   ### Check List
   
   - Test:
       - FE Unit Test:
           - `TableStatsMetaTest.testNewBootstrapStatsSeedsBaseIndexRowCount` — 
verifies bootstrap metadata field correctness
           - 
`OlapInsertExecutorTest.testExecuteSingleInsertVisibleBootstrapsTableStatsWhenAbsent`
 — verifies bootstrap takes effect when enabled
           - 
`OlapInsertExecutorTest.testExecuteSingleInsertVisibleDoesNotBootstrapTableStatsWhenDisabled`
 — verifies no bootstrap when disabled (default)
           - 
`ShowTableStatsCommandTest.testConstructTableResultSetForBootstrapStats` — 
verifies `SHOW TABLE STATS` renders bootstrap metadata without NPE
       - Regression Test: `insert_select_table_stats_bootstrap.groovy` — 
two-phase assertions: when disabled, `stats=1` and large table is broadcast; 
when enabled, `stats=262,144` and small table is broadcast. Ran 10 consecutive 
times on a remote Doris instance, all passed.
       - Manual Test: verified on a deployed remote Doris instance with the 
latest code.
   - Behavior changed: No (disabled by default, no impact on existing behavior)
   - Does this need documentation: Yes (new session variable)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to