[ 
https://issues.apache.org/jira/browse/KAFKA-14242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Chen updated KAFKA-14242:
------------------------------
    Description: 
Recently, we got a lot of build failed (and terminated) with core:unitTest 
failure. The failed messages look like this:
{code:java}
FAILURE: Build failed with an exception.
[2022-09-14T09:51:52.190Z] 
[2022-09-14T09:51:52.190Z] * What went wrong:
[2022-09-14T09:51:52.190Z] Execution failed for task ':core:unitTest'.
[2022-09-14T09:51:52.190Z] > Process 'Gradle Test Executor 128' finished with 
non-zero exit value 1 {code}


After investigation, I found one reason of it (maybe there are other reasons). 
In {{BrokerMetadataPublisherTest#testReloadUpdatedFilesWithoutConfigChange}} 
test, we created logManager twice, but when cleanup, we only close one of them. 
So, there will be a log cleaner keeping running. But during this time, the temp 
log dirs are deleted, so it will {{{}Exit.halt(1){}}}, and got the error we saw 
in gradle, like this code did when we encounter IOException in all our log dirs:
{code:java}
fatal(s"Shutdown broker because all log dirs in ${logDirs.mkString(", ")} have 
failed")
Exit.halt(1) {code}

And, why does it sometimes pass, sometimes failed? Because during test cluster 
close, we shutdown broker first, and then other components. And the log cleaner 
is triggered in an interval. So, if the cluster can close fast enough, and 
finish this test, it'll be passed. Otherwise, it'll exit with 1.

  was:
Recently, we got a lot of build failed (and terminated) with core:unitTest 
failure. The failed messages look like this:
FAILURE: Build failed with an exception.
[2022-09-14T09:51:52.190Z] 
[2022-09-14T09:51:52.190Z] * What went wrong:
[2022-09-14T09:51:52.190Z] Execution failed for task ':core:unitTest'.
[2022-09-14T09:51:52.190Z] > Process 'Gradle Test Executor 128' finished with 
non-zero exit value 1{{}}
After investigation, I found one reason of it (maybe there are other reasons). 
In {{BrokerMetadataPublisherTest#testReloadUpdatedFilesWithoutConfigChange}} 
test, we created logManager twice, but when cleanup, we only close one of them. 
So, there will be a log cleaner keeping running. But during this time, the temp 
log dirs are deleted, so it will {{{}Exit.halt(1){}}}, and got the error we saw 
in gradle, like this code did when we encounter IOException in all our log dirs:
fatal(s"Shutdown broker because all log dirs in ${logDirs.mkString(", ")} have 
failed")
Exit.halt(1){{}}
And, why does it sometimes pass, sometimes failed? Because during test cluster 
close, we shutdown broker first, and then other components. And the log cleaner 
is triggered in an interval. So, if the cluster can close fast enough, and 
finish this test, it'll be passed. Otherwise, it'll exit with 1.


> Hanging logManager in testReloadUpdatedFilesWithoutConfigChange test
> --------------------------------------------------------------------
>
>                 Key: KAFKA-14242
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14242
>             Project: Kafka
>          Issue Type: Test
>            Reporter: Luke Chen
>            Assignee: Luke Chen
>            Priority: Major
>
> Recently, we got a lot of build failed (and terminated) with core:unitTest 
> failure. The failed messages look like this:
> {code:java}
> FAILURE: Build failed with an exception.
> [2022-09-14T09:51:52.190Z] 
> [2022-09-14T09:51:52.190Z] * What went wrong:
> [2022-09-14T09:51:52.190Z] Execution failed for task ':core:unitTest'.
> [2022-09-14T09:51:52.190Z] > Process 'Gradle Test Executor 128' finished with 
> non-zero exit value 1 {code}
> After investigation, I found one reason of it (maybe there are other 
> reasons). In 
> {{BrokerMetadataPublisherTest#testReloadUpdatedFilesWithoutConfigChange}} 
> test, we created logManager twice, but when cleanup, we only close one of 
> them. So, there will be a log cleaner keeping running. But during this time, 
> the temp log dirs are deleted, so it will {{{}Exit.halt(1){}}}, and got the 
> error we saw in gradle, like this code did when we encounter IOException in 
> all our log dirs:
> {code:java}
> fatal(s"Shutdown broker because all log dirs in ${logDirs.mkString(", ")} 
> have failed")
> Exit.halt(1) {code}
> And, why does it sometimes pass, sometimes failed? Because during test 
> cluster close, we shutdown broker first, and then other components. And the 
> log cleaner is triggered in an interval. So, if the cluster can close fast 
> enough, and finish this test, it'll be passed. Otherwise, it'll exit with 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to