zy-kkk opened a new pull request, #57856:
URL: https://github.com/apache/doris/pull/57856

    **Symptom:** External table queries hang indefinitely, FE process frozen.
   
   **User-facing impact:** Query threads blocked waiting for schema cache:
   
     ```
     "mysql-nio-pool-14981" TIMED_WAITING
        at LinkedBlockingQueue.offer(LinkedBlockingQueue.java:385)
        at 
ThreadPoolManager$BlockedPolicy.rejectedExecution(ThreadPoolManager.java:347)
        at ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
        at CompletableFuture.asyncSupplyStage(CompletableFuture.java:1618)
        at CacheLoader.asyncReload(CacheLoader.java:188)
        at BoundedLocalCache.refreshIfNeeded(BoundedLocalCache.java:1214)
        at LocalLoadingCache.get(LocalLoadingCache.java:56)
        at ExternalSchemaCache.getSchemaValue(ExternalSchemaCache.java:86)
        at ExternalTable.getSchemaCacheValue(ExternalTable.java:371)
        at HMSExternalTable.getPartitionColumns(HMSExternalTable.java:288)
        at 
PruneFileScanPartition.pruneHivePartitions(PruneFileScanPartition.java:84)
   
     "CommonRefreshExecutor-63" TIMED_WAITING
        at LinkedBlockingQueue.offer(LinkedBlockingQueue.java:385)
        at 
ThreadPoolManager$BlockedPolicy.rejectedExecution(ThreadPoolManager.java:347)
        at ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
        at CompletableFuture.asyncSupplyStage(CompletableFuture.java:1618)
        at CacheLoader.asyncReload(CacheLoader.java:188)
        at BoundedLocalCache.refreshIfNeeded(BoundedLocalCache.java:1214)
   
     "CommonRefreshExecutor-62" TIMED_WAITING
        at LinkedBlockingQueue.offer(LinkedBlockingQueue.java:385)
        at 
ThreadPoolManager$BlockedPolicy.rejectedExecution(ThreadPoolManager.java:347)
        at ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
        at BoundedLocalCache.notifyRemoval(BoundedLocalCache.java:333)
        at BoundedLocalCache.removeNode(BoundedLocalCache.java:1882)
        at LocalManualCache.invalidateAll(LocalManualCache.java:150)
        at MetaCache.invalidateAll(MetaCache.java:121)
        at ExternalDatabase.setUnInitialized(ExternalDatabase.java:123)
   ```
   
    **Root cause**: Caffeine cache deadlock when:
     1. MetaCache uses bounded executor (CommonRefreshExecutor: 64 threads + 
640K queue) for both async operations and removal listeners
     2. Database cache removal listener calls tableCache.invalidateAll()
     3. Executor is full (all threads busy + queue full)
     4. Both async reload and removal listener try to submit tasks to full 
executor
     5. Deadlock: executor threads wait for tasks, tasks wait for executor slots
   
     Jstack evidence: 82 CommonRefreshExecutor threads blocked on 
LinkedBlockingQueue.offer():
   
   **Solution**
   
     - Add CacheFactory.buildCacheWithSyncRemovalListener() using Runnable::run 
executor
     - MetaCache.metaObjCache uses sync removal listener to avoid executor 
contention
     - Removal listener runs inline on calling thread instead of submitting to 
executor
   
   **Changes**
   
     - CacheFactory: Add buildCacheWithSyncRemovalListener() and 
buildCacheWithAsyncRemovalListener()
     - MetaCache: Use buildCacheWithSyncRemovalListener() for metaObjCache
     - Add MetaCacheDeadlockTest to verify fix
   
   **Test**
   
     Unit test reproduces deadlock with async removal listener and verifies fix 
with sync removal listener.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to