deardeng opened a new pull request, #60480:
URL: https://github.com/apache/doris/pull/60480

   ## Proposed changes
   
   ### Problem
   
   During cloud tablet decommission, some tablets take unexpectedly long time 
(5+ minutes) to migrate because FE keeps waiting for warmup tasks to complete, 
even though the tasks have already failed on BE side.
   
   **Root cause**: In `FileCacheBlockDownloader::download_file_cache_block()`, 
when early return occurs (e.g., tablet not found, rowset not found, storage 
resource error), the `_inflight_tablets` count is not decremented. This causes:
   
   1. `check_download_task()` always returns `done=false` for these tablets
   2. FE's `checkInflightWarmUpCacheAsync()` waits until timeout (default 300 
seconds)
   3. Tablet migration is blocked unnecessarily
   
   **Example log showing the issue**:
   ```
   W download_file_cache_block: tablet_id=1769675033824 rowset_id not found, 
rowset_id=020000000010fa85...
   ```
   After this warning, the tablet's inflight count remains in 
`_inflight_tablets` map, causing the 5-minute wait before FE times out and 
proceeds.
   
   ### Solution
   
   1. Extract the inflight count decrement logic into a reusable lambda 
`decrease_inflight_count`
   
   2. Call `decrease_inflight_count()` in all early return paths:
      - When `get_tablet()` fails
      - When `rowset_id` is not found  
      - When `remote_storage_resource()` fails
   
   3. Refactor `download_done` callback to reuse `decrease_inflight_count`, 
eliminating code duplication
   
   4. Use value capture for `decrease_inflight_count` in `download_done` lambda 
to ensure lifetime safety if the callback is ever called asynchronously in the 
future
   
   5. Add unit tests to verify inflight count is correctly decremented on 
failures
   
   ## Further comments
   
   This bug also causes a minor memory leak: entries in `_inflight_tablets` map 
are never cleaned up when warmup fails, slowly accumulating over time (cleared 
on BE restart).
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: 
       - [ ] Yes
       - [x] No
   2. Has unit tests been added:
       - [x] Yes
       - [ ] No 
   3. Has document been added or modified:
       - [ ] Yes
       - [x] No
   4. Does it need to update dependencies:
       - [ ] Yes
       - [x] No
   5. Is there any sharding changes:
       - [ ] Yes
       - [x] No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to