[ 
https://issues.apache.org/jira/browse/KAFKA-16225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17832619#comment-17832619
 ] 

Chia-Ping Tsai commented on KAFKA-16225:
----------------------------------------

It seems to me the root cause is `LogDirFailureHandler` does not clean 
`directoryIds` in holding `replicaStateChangeLock`, and hence metadata event 
thread can see the intermediate state of failure handle and then assume the 
deleted folder is online 
([https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/cluster/Partition.scala#L877])
{code:scala}
  private def createLogInAssignedDirectoryId(partitionState: 
LeaderAndIsrPartitionState, highWatermarkCheckpoints: OffsetCheckpoints, 
topicId: Option[Uuid], targetLogDirectoryId: Option[Uuid]): Unit = {
    targetLogDirectoryId match {
      case Some(directoryId) =>
        // [CHIA] logManager.onlineLogDirId(directoryId) return true
        // there are two results:
        // 1) KafkaStorageException is thrown by `LogManager.getOrCreateLog`
        // 2) log is hosted by another directory (id) rather than 
targetLogDirectoryId
          if (logManager.onlineLogDirId(directoryId) || 
!logManager.hasOfflineLogDirs() || directoryId == DirectoryId.UNASSIGNED) {
          createLogIfNotExists(partitionState.isNew, isFutureReplica = false, 
highWatermarkCheckpoints, topicId, targetLogDirectoryId)
        } else {
          warn(s"Skipping creation of log because there are potentially offline 
log " +
            s"directories and log may already exist there. 
directoryId=$directoryId, " +
            s"topicId=$topicId, targetLogDirectoryId=$targetLogDirectoryId")
        }

      case None =>
        createLogIfNotExists(partitionState.isNew, isFutureReplica = false, 
highWatermarkCheckpoints, topicId)
    }
  }
{code}
Hence, there are two options to stabilize the `testIOExceptionDuringLogRoll`

1. call `logManager.handleLogDirFailure(dir)` in holding 
`replicaStateChangeLock` 
([https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/ReplicaManager.scala#L2473])
 to avoid race condition.

2. change the assert to allow both empty folder and the folder which having 
different directory (id). 
[https://github.com/apache/kafka/blob/trunk/core/src/test/scala/unit/kafka/utils/TestUtils.scala#L1715]

BTW, the option 2 means we allow to using other directory to replace 
`targetLogDirectoryId` when creating log. That violates the comment: "@param 
targetLogDirectoryId The directory Id that should host the the partition's 
topic." 
([https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogManager.scala#L1015])

> Flaky test suite LogDirFailureTest
> ----------------------------------
>
>                 Key: KAFKA-16225
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16225
>             Project: Kafka
>          Issue Type: Bug
>          Components: core, unit tests
>            Reporter: Greg Harris
>            Assignee: Omnia Ibrahim
>            Priority: Major
>              Labels: flaky-test
>
> I see this failure on trunk and in PR builds for multiple methods in this 
> test suite:
> {noformat}
> org.opentest4j.AssertionFailedError: expected: <true> but was: <false>    
> at 
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>     
> at 
> org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>     
> at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)    
> at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)    
> at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)    
> at org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:179)    
> at kafka.utils.TestUtils$.causeLogDirFailure(TestUtils.scala:1715)    
> at 
> kafka.server.LogDirFailureTest.testProduceAfterLogDirFailureOnLeader(LogDirFailureTest.scala:186)
>     
> at 
> kafka.server.LogDirFailureTest.testIOExceptionDuringLogRoll(LogDirFailureTest.scala:70){noformat}
> It appears this assertion is failing
> [https://github.com/apache/kafka/blob/f54975c33135140351c50370282e86c49c81bbdd/core/src/test/scala/unit/kafka/utils/TestUtils.scala#L1715]
> The other error which is appearing is this:
> {noformat}
> org.opentest4j.AssertionFailedError: Unexpected exception type thrown, 
> expected: <java.util.concurrent.ExecutionException> but was: 
> <java.lang.IllegalStateException>    
> at 
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>     
> at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:67)    
> at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:35)    
> at org.junit.jupiter.api.Assertions.assertThrows(Assertions.java:3111)    
> at 
> kafka.server.LogDirFailureTest.testProduceErrorsFromLogDirFailureOnLeader(LogDirFailureTest.scala:164)
>     
> at 
> kafka.server.LogDirFailureTest.testProduceErrorFromFailureOnLogRoll(LogDirFailureTest.scala:64){noformat}
> Failures appear to have started in this commit, but this does not indicate 
> that this commit is at fault: 
> [https://github.com/apache/kafka/tree/3d95a69a28c2d16e96618cfa9a1eb69180fb66ea]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to