[ 
https://issues.apache.org/jira/browse/KAFKA-17142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chia-Ping Tsai resolved KAFKA-17142.
------------------------------------
    Fix Version/s: 3.9.0
         Assignee: PoAn Yang  (was: Chia-Ping Tsai)
       Resolution: Fixed

> Fix deadlock caused by LogManagerTest#testLogRecoveryMetrics
> ------------------------------------------------------------
>
>                 Key: KAFKA-17142
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17142
>             Project: Kafka
>          Issue Type: Test
>            Reporter: Chia-Ping Tsai
>            Assignee: PoAn Yang
>            Priority: Major
>             Fix For: 3.9.0
>
>         Attachments: log
>
>
> `testLogRecoveryMetrics` uses mock scheduler to create UnifiedLog [0], and 
> the mock scheduler does NOT have true thread poll, and that is the root cause!
> We create recovery threads [1] for each data folder, and they will pass the 
> action: `writeToFileForTruncation` to scheduler [2]. The action requires the 
> read lock [3], so the deadlock is produced when one thread executes the 
> action from another thread. For example:
> 1. thread_a is handling dir_a, and it holds the writelock_a
> 2. thread_a pass action_a: `writeToFileForTruncation` to mock scheduler
> 3. thread_b is handling dir_b, and it holds the writelock_b
> 4. thread_b pass action_b: `writeToFileForTruncation` to mock scheduler
> 5. thread_b (holding writelock_b) handle the action_a, so it requires the 
> readlock_a
> 6. thread_a (holding writelock_a) handle the action_b, so it requires the 
> readlock_b
> so lucky we have a deadlock :cry
> This deadlock happens due to the mock scheduler, so that is a issue belonging 
> to test. We can fix it by a simple solution - add some delay when creating 
> next UnifiedLog
> Or we can do a bit refactor to production code: create snapshot of `epochs` 
> when we are holding write lock! That means `writeToFileForTruncation` does 
> not take lock anymore. For example:
> {code:java}
>     public void truncateFromStartAsyncFlush(long startOffset) {
>         lock.writeLock().lock();
>         try {
>             List<EpochEntry> removedEntries = truncateFromStart(epochs, 
> startOffset);
>             if (!removedEntries.isEmpty()) {
>                 ...
>                 List<EpochEntry> entries = new ArrayList<>(epochs.values());
>                 scheduler.scheduleOnce("leader-epoch-cache-flush-" + 
> topicPartition, () -> checkpoint.writeForTruncation(entries));
>                 ...
>             }
>         } finally {
>             lock.writeLock().unlock();
>         }
>     }
> {code}
> [0] 
> https://github.com/apache/kafka/blob/808498e9391dab292a7ccd8a0bf3713f444f9d2f/core/src/test/scala/unit/kafka/log/LogManagerTest.scala#L968
> [1] 
> https://github.com/apache/kafka/blob/808498e9391dab292a7ccd8a0bf3713f444f9d2f/core/src/main/scala/kafka/log/LogManager.scala#L434
> [2] 
> https://github.com/apache/kafka/blob/808498e9391dab292a7ccd8a0bf3713f444f9d2f/storage/src/main/java/org/apache/kafka/storage/internals/epoch/LeaderEpochFileCache.java#L351
> [3] 
> https://github.com/apache/kafka/blob/808498e9391dab292a7ccd8a0bf3713f444f9d2f/storage/src/main/java/org/apache/kafka/storage/internals/epoch/LeaderEpochFileCache.java#L528
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to