zy-kkk opened a new pull request, #57856:
URL: https://github.com/apache/doris/pull/57856
**Symptom:** External table queries hang indefinitely, FE process frozen.
**User-facing impact:** Query threads blocked waiting for schema cache:
```
"mysql-nio-pool-14981" TIMED_WAITING
at LinkedBlockingQueue.offer(LinkedBlockingQueue.java:385)
at
ThreadPoolManager$BlockedPolicy.rejectedExecution(ThreadPoolManager.java:347)
at ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
at CompletableFuture.asyncSupplyStage(CompletableFuture.java:1618)
at CacheLoader.asyncReload(CacheLoader.java:188)
at BoundedLocalCache.refreshIfNeeded(BoundedLocalCache.java:1214)
at LocalLoadingCache.get(LocalLoadingCache.java:56)
at ExternalSchemaCache.getSchemaValue(ExternalSchemaCache.java:86)
at ExternalTable.getSchemaCacheValue(ExternalTable.java:371)
at HMSExternalTable.getPartitionColumns(HMSExternalTable.java:288)
at
PruneFileScanPartition.pruneHivePartitions(PruneFileScanPartition.java:84)
"CommonRefreshExecutor-63" TIMED_WAITING
at LinkedBlockingQueue.offer(LinkedBlockingQueue.java:385)
at
ThreadPoolManager$BlockedPolicy.rejectedExecution(ThreadPoolManager.java:347)
at ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
at CompletableFuture.asyncSupplyStage(CompletableFuture.java:1618)
at CacheLoader.asyncReload(CacheLoader.java:188)
at BoundedLocalCache.refreshIfNeeded(BoundedLocalCache.java:1214)
"CommonRefreshExecutor-62" TIMED_WAITING
at LinkedBlockingQueue.offer(LinkedBlockingQueue.java:385)
at
ThreadPoolManager$BlockedPolicy.rejectedExecution(ThreadPoolManager.java:347)
at ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
at BoundedLocalCache.notifyRemoval(BoundedLocalCache.java:333)
at BoundedLocalCache.removeNode(BoundedLocalCache.java:1882)
at LocalManualCache.invalidateAll(LocalManualCache.java:150)
at MetaCache.invalidateAll(MetaCache.java:121)
at ExternalDatabase.setUnInitialized(ExternalDatabase.java:123)
```
**Root cause**: Caffeine cache deadlock when:
1. MetaCache uses bounded executor (CommonRefreshExecutor: 64 threads +
640K queue) for both async operations and removal listeners
2. Database cache removal listener calls tableCache.invalidateAll()
3. Executor is full (all threads busy + queue full)
4. Both async reload and removal listener try to submit tasks to full
executor
5. Deadlock: executor threads wait for tasks, tasks wait for executor slots
Jstack evidence: 82 CommonRefreshExecutor threads blocked on
LinkedBlockingQueue.offer():
**Solution**
- Add CacheFactory.buildCacheWithSyncRemovalListener() using Runnable::run
executor
- MetaCache.metaObjCache uses sync removal listener to avoid executor
contention
- Removal listener runs inline on calling thread instead of submitting to
executor
**Changes**
- CacheFactory: Add buildCacheWithSyncRemovalListener() and
buildCacheWithAsyncRemovalListener()
- MetaCache: Use buildCacheWithSyncRemovalListener() for metaObjCache
- Add MetaCacheDeadlockTest to verify fix
**Test**
Unit test reproduces deadlock with async removal listener and verifies fix
with sync removal listener.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]