[
https://issues.apache.org/jira/browse/KAFKA-16225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17832619#comment-17832619
]
Chia-Ping Tsai commented on KAFKA-16225:
----------------------------------------
It seems to me the root cause is `LogDirFailureHandler` does not clean
`directoryIds` in holding `replicaStateChangeLock`, and hence metadata event
thread can see the intermediate state of failure handle and then assume the
deleted folder is online
([https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/cluster/Partition.scala#L877])
{code:scala}
private def createLogInAssignedDirectoryId(partitionState:
LeaderAndIsrPartitionState, highWatermarkCheckpoints: OffsetCheckpoints,
topicId: Option[Uuid], targetLogDirectoryId: Option[Uuid]): Unit = {
targetLogDirectoryId match {
case Some(directoryId) =>
// [CHIA] logManager.onlineLogDirId(directoryId) return true
// there are two results:
// 1) KafkaStorageException is thrown by `LogManager.getOrCreateLog`
// 2) log is hosted by another directory (id) rather than
targetLogDirectoryId
if (logManager.onlineLogDirId(directoryId) ||
!logManager.hasOfflineLogDirs() || directoryId == DirectoryId.UNASSIGNED) {
createLogIfNotExists(partitionState.isNew, isFutureReplica = false,
highWatermarkCheckpoints, topicId, targetLogDirectoryId)
} else {
warn(s"Skipping creation of log because there are potentially offline
log " +
s"directories and log may already exist there.
directoryId=$directoryId, " +
s"topicId=$topicId, targetLogDirectoryId=$targetLogDirectoryId")
}
case None =>
createLogIfNotExists(partitionState.isNew, isFutureReplica = false,
highWatermarkCheckpoints, topicId)
}
}
{code}
Hence, there are two options to stabilize the `testIOExceptionDuringLogRoll`
1. call `logManager.handleLogDirFailure(dir)` in holding
`replicaStateChangeLock`
([https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/ReplicaManager.scala#L2473])
to avoid race condition.
2. change the assert to allow both empty folder and the folder which having
different directory (id).
[https://github.com/apache/kafka/blob/trunk/core/src/test/scala/unit/kafka/utils/TestUtils.scala#L1715]
BTW, the option 2 means we allow to using other directory to replace
`targetLogDirectoryId` when creating log. That violates the comment: "@param
targetLogDirectoryId The directory Id that should host the the partition's
topic."
([https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogManager.scala#L1015])
> Flaky test suite LogDirFailureTest
> ----------------------------------
>
> Key: KAFKA-16225
> URL: https://issues.apache.org/jira/browse/KAFKA-16225
> Project: Kafka
> Issue Type: Bug
> Components: core, unit tests
> Reporter: Greg Harris
> Assignee: Omnia Ibrahim
> Priority: Major
> Labels: flaky-test
>
> I see this failure on trunk and in PR builds for multiple methods in this
> test suite:
> {noformat}
> org.opentest4j.AssertionFailedError: expected: <true> but was: <false>
> at
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>
> at
> org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
>
> at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63)
> at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36)
> at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:31)
> at org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:179)
> at kafka.utils.TestUtils$.causeLogDirFailure(TestUtils.scala:1715)
> at
> kafka.server.LogDirFailureTest.testProduceAfterLogDirFailureOnLeader(LogDirFailureTest.scala:186)
>
> at
> kafka.server.LogDirFailureTest.testIOExceptionDuringLogRoll(LogDirFailureTest.scala:70){noformat}
> It appears this assertion is failing
> [https://github.com/apache/kafka/blob/f54975c33135140351c50370282e86c49c81bbdd/core/src/test/scala/unit/kafka/utils/TestUtils.scala#L1715]
> The other error which is appearing is this:
> {noformat}
> org.opentest4j.AssertionFailedError: Unexpected exception type thrown,
> expected: <java.util.concurrent.ExecutionException> but was:
> <java.lang.IllegalStateException>
> at
> org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
>
> at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:67)
> at org.junit.jupiter.api.AssertThrows.assertThrows(AssertThrows.java:35)
> at org.junit.jupiter.api.Assertions.assertThrows(Assertions.java:3111)
> at
> kafka.server.LogDirFailureTest.testProduceErrorsFromLogDirFailureOnLeader(LogDirFailureTest.scala:164)
>
> at
> kafka.server.LogDirFailureTest.testProduceErrorFromFailureOnLogRoll(LogDirFailureTest.scala:64){noformat}
> Failures appear to have started in this commit, but this does not indicate
> that this commit is at fault:
> [https://github.com/apache/kafka/tree/3d95a69a28c2d16e96618cfa9a1eb69180fb66ea]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)