hello-stephen opened a new pull request, #63885:
URL: https://github.com/apache/doris/pull/63885

   ## Summary
   
   Discovered during 4.0.6-rc01 release testing. `cloud/mow/mow_correctness` 
chaos tests crashed with:
   
   ```
   F20260529 00:04:54.301800 cloud_tablet.cpp:1494] Check failed: false
   cumulative compaction: the merged rows(223), the filtered rows(0) is not 
equal to
   missed rows(141) in rowid conversion, tablet_id: 1779975412591, 
table_id:1779975342402
   ```
   
   Stack trace:
   ```
   CloudTablet::calc_delete_bitmap_for_compaction() at cloud_tablet.cpp:1576
   CloudCumulativeCompaction::modify_rowsets() at 
cloud_cumulative_compaction.cpp:315
   CloudCompactionMixin::execute_compact_impl() at compaction.cpp:1619
   ```
   
   Failed test cases: `stress.chaos_sidecar.fuzzy_config_chaos`, 
`stress.chaos_sidecar.debug_points_chaos`
   
   ### Root cause analysis
   
   Comparing `CloudTablet::calc_delete_bitmap_for_compaction` with the 
equivalent local tablet check in `Tablet::execute_compact_impl` 
(`compaction.cpp`), there are two discrepancies:
   
   **Issue 1: Incorrect `filtered_rows` inclusion**
   
   Cloud version (before fix):
   ```cpp
   if (merged_rows + filtered_rows >= 0 &&
       merged_rows + filtered_rows != missed_rows_size) {
   ```
   
   Local tablet version (`compaction.cpp:1312-1315`):
   ```cpp
   std::size_t merged_missed_rows_size = _stats.merged_rows;
   if (!_tablet->tablet_meta()->tablet_schema()->cluster_key_uids().empty()) {
       merged_missed_rows_size += _stats.filtered_rows;  // only for cluster 
key tables
   }
   ```
   
   The local version only adds `filtered_rows` for cluster-key tables because 
delete signs are pruned during compaction in that case. For non-cluster-key MoW 
tables, `filtered_rows` should NOT be added to the comparison.
   
   In the crash: `merged_rows=223, filtered_rows=0` — while this specific 
instance had `filtered_rows=0`, the logic discrepancy for non-zero 
`filtered_rows` in non-cluster-key tables could cause false positives.
   
   **Issue 2: Missing debug info**
   
   The local tablet version (`compaction.cpp:1347-1358`) logs per-rowset delete 
bitmap cardinality when `missed_rows_size == 0` for easier diagnosis. This 
information is now added to the cloud version as well.
   
   ### Changes
   
   - Align `merged_rows` comparison logic with local tablet (only add 
`filtered_rows` for cluster-key tables)
   - Add per-rowset delete bitmap cardinality debug logging when 
`missed_rows_size == 0`
   - Remove the redundant `>= 0` guard (always true for non-negative row counts)
   
   ### Note on remaining investigation
   
   The root cause of WHY `merged_rows (223) != missed_rows (141)` in the chaos 
test scenario still needs investigation. The crash was triggered by 
`fuzzy_config_chaos` + `debug_points_chaos` which create concurrent load + 
compaction edge cases. This PR fixes a logic discrepancy but the underlying 
scenario that produces the mismatch (likely concurrent loads during compaction) 
should be tracked separately.
   
   Related PRs: #43909, #50966, #54650
   
   ## Test plan
   
   - [ ] Regression test: `cloud/mow/mow_correctness` stress test
   - [ ] Verify crash no longer occurs with 
`enable_mow_compaction_correctness_check_core=true`
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to