Hastyshell opened a new pull request, #61380:
URL: https://github.com/apache/doris/pull/61380
## Problem
In cloud mode, schema change on MOW (Merge-on-Write) tables intermittently
fails with:
```
task type: ALTER, status_code: INTERNAL_ERROR, status_message:
[(BE_IP)[INTERNAL_ERROR]failed to start tablet job:
meta_service_job.cpp could not perform compaction on expired tablet cache.
req_base_compaction_cnt=0, base_compaction_cnt=0,
req_cumulative_compaction_cnt=8, cumulative_compaction_cnt=9]
```
## Root Cause
Schema change on a MOW table calls `_process_delete_bitmap()`, which
registers a `STOP_TOKEN` compaction job via
`CloudCompactionStopToken::do_register()`. The `STOP_TOKEN` is **not a real
compaction** — it is a lock marker that blocks concurrent compactions during
delete bitmap recalculation.
However, `start_compaction_job()` in the meta-service applies the stale
tablet cache check **unconditionally to all compaction types**, including
`STOP_TOKEN`. If a concurrent compaction on another BE node advances
`cumulative_compaction_cnt` in the meta-service while the schema change BE
still holds its old cached value, the `STOP_TOKEN` registration is rejected
with `STALE_TABLET_CACHE`. This error propagates back to the FE as a fatal
ALTER task failure.
## Fix
Skip the stale tablet cache check when the compaction job type is
`STOP_TOKEN`. Since `STOP_TOKEN` does not read or compact any rowsets,
verifying the freshness of cached compaction counts is meaningless for it.
```cpp
// Before
if (compaction.base_compaction_cnt() < stats.base_compaction_cnt() ||
compaction.cumulative_compaction_cnt() <
stats.cumulative_compaction_cnt()) {
// After
if (compaction.type() != TabletCompactionJobPB::STOP_TOKEN &&
(compaction.base_compaction_cnt() < stats.base_compaction_cnt() ||
compaction.cumulative_compaction_cnt() <
stats.cumulative_compaction_cnt())) {
```
## Testing
Added regression test `StopTokenSkipsStaleTabletCacheCheck` in
`cloud/test/meta_service_job_test.cpp` that:
1. Sets up a tablet with `cumulative_compaction_cnt=9` on the meta-service
side
2. Verifies that a regular `CUMULATIVE` compaction with stale count=8 is
still correctly rejected with `STALE_TABLET_CACHE`
3. Verifies that a `STOP_TOKEN` with the same stale count=8 succeeds with
`OK`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]