hello-stephen opened a new pull request, #63885:
URL: https://github.com/apache/doris/pull/63885
## Summary
Discovered during 4.0.6-rc01 release testing. `cloud/mow/mow_correctness`
chaos tests crashed with:
```
F20260529 00:04:54.301800 cloud_tablet.cpp:1494] Check failed: false
cumulative compaction: the merged rows(223), the filtered rows(0) is not
equal to
missed rows(141) in rowid conversion, tablet_id: 1779975412591,
table_id:1779975342402
```
Stack trace:
```
CloudTablet::calc_delete_bitmap_for_compaction() at cloud_tablet.cpp:1576
CloudCumulativeCompaction::modify_rowsets() at
cloud_cumulative_compaction.cpp:315
CloudCompactionMixin::execute_compact_impl() at compaction.cpp:1619
```
Failed test cases: `stress.chaos_sidecar.fuzzy_config_chaos`,
`stress.chaos_sidecar.debug_points_chaos`
### Root cause analysis
Comparing `CloudTablet::calc_delete_bitmap_for_compaction` with the
equivalent local tablet check in `Tablet::execute_compact_impl`
(`compaction.cpp`), there are two discrepancies:
**Issue 1: Incorrect `filtered_rows` inclusion**
Cloud version (before fix):
```cpp
if (merged_rows + filtered_rows >= 0 &&
merged_rows + filtered_rows != missed_rows_size) {
```
Local tablet version (`compaction.cpp:1312-1315`):
```cpp
std::size_t merged_missed_rows_size = _stats.merged_rows;
if (!_tablet->tablet_meta()->tablet_schema()->cluster_key_uids().empty()) {
merged_missed_rows_size += _stats.filtered_rows; // only for cluster
key tables
}
```
The local version only adds `filtered_rows` for cluster-key tables because
delete signs are pruned during compaction in that case. For non-cluster-key MoW
tables, `filtered_rows` should NOT be added to the comparison.
In the crash: `merged_rows=223, filtered_rows=0` — while this specific
instance had `filtered_rows=0`, the logic discrepancy for non-zero
`filtered_rows` in non-cluster-key tables could cause false positives.
**Issue 2: Missing debug info**
The local tablet version (`compaction.cpp:1347-1358`) logs per-rowset delete
bitmap cardinality when `missed_rows_size == 0` for easier diagnosis. This
information is now added to the cloud version as well.
### Changes
- Align `merged_rows` comparison logic with local tablet (only add
`filtered_rows` for cluster-key tables)
- Add per-rowset delete bitmap cardinality debug logging when
`missed_rows_size == 0`
- Remove the redundant `>= 0` guard (always true for non-negative row counts)
### Note on remaining investigation
The root cause of WHY `merged_rows (223) != missed_rows (141)` in the chaos
test scenario still needs investigation. The crash was triggered by
`fuzzy_config_chaos` + `debug_points_chaos` which create concurrent load +
compaction edge cases. This PR fixes a logic discrepancy but the underlying
scenario that produces the mismatch (likely concurrent loads during compaction)
should be tracked separately.
Related PRs: #43909, #50966, #54650
## Test plan
- [ ] Regression test: `cloud/mow/mow_correctness` stress test
- [ ] Verify crash no longer occurs with
`enable_mow_compaction_correctness_check_core=true`
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]