[ 
https://issues.apache.org/jira/browse/KAFKA-10723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kowshik Prakasam updated KAFKA-10723:
-------------------------------------
    Description: 
*TL;DR:*

The asynchronous shutdown in {{LogManager}} has the shortcoming that if during 
shutdown any of the internal futures fail, then we do not always ensure that 
all futures are completed before {{LogManager.shutdown}} returns. As a result, 
despite the shut down completed message from KafkaServer is seen in the error 
logs, some futures continue to run from inside LogManager attempting to close 
the logs.

*Description:*

When LogManager is shutting down, exceptions in log closure are handled 
[here|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogManager.scala#L497-L501].
 However, this 
[line|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogManager.scala#L502]
 in the finally clause shutsdown the thread pools *asynchronously*.
{noformat}
threadPools.foreach(_.shutdown()){noformat}
initiates an orderly shutdown (for each thread pool) in which previously 
submitted tasks are executed, but no new tasks will be accepted (see javadoc 
link 
[here|https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html#shutdown()]).
 As a result, if there is an exception during log closure, some of the thread 
pools which are closing logs could be leaked and continue to run in the 
background, after the control returns to the caller (i.e. {{KafkaServer}}). As 
a result, even after the _shut down completed_ message is seen in the error 
logs (originating from {{KafkaServer}} shutdown sequence), log closures 
continue to happen in the background. 
  

*Proposed options for fixes:*

It seems useful that we maintain the contract with {{KafkaServer}} that after 
{{LogManager.shutdown}} is called once, all tasks that close the logs are 
guaranteed to have completed before the call returns. There are probably couple 
ways to fix this:
 # A simple fix could be to replace `threadPools.foreach(_.shutdown())` with 
`threadPools.foreach(_.awaitTermination())`. This ensures that we wait for all 
threads to be shutdown before returning `LogManager.shutdown`.
 # An advanced fix could be to just skip creating of checkpoint and clean 
shutdown file only for the affected directory if any of its futures throw an 
error. We continue to wait for all futures to complete for all directories.

  was:
*TL;DR:*

The asynchronous shutdown in {{LogManager}} has the shortcoming that if during 
shutdown any of the internal futures fail, then we do not always ensure that 
all futures are completed before {{LogManager.shutdown}} returns. As a result, 
despite the shut down completed message from KafkaServer is seen in the error 
logs, some futures continue to run from inside LogManager attempting to close 
the logs.

*Description:*

When LogManager is shutting down, exceptions in log closure are handled 
[here|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogManager.scala#L497-L501].
 However, this 
[line|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogManager.scala#L502]
 in the finally clause may still throw an exception once again (for ex: if any 
of the tasks fail during thread pool shutdown):
{noformat}
threadPools.foreach(_.shutdown()){noformat}
initiates an orderly shutdown (for each thread pool) in which previously 
submitted tasks are executed, but no new tasks will be accepted (see javadoc 
link 
[here|https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html#shutdown()]).
 It is possible that any of the other running threads that asynchronously close 
logs may throw an exception. When this happens, {{ExecutoreService.shutdown()}} 
can throw an exception. This situation is not handled inside the finally clause 
which means that some of the thread pools which are closing logs could be 
leaked and continue to run in the background, after the control returns to the 
caller (i.e. {{KafkaServer}}).

As a result, even after the _shut down completed_ message is seen in the error 
logs (originating from {{KafkaServer}} shutdown sequence), log closures 
continue to happen in the background. 
  

*Proposed fix:*

It seems useful that we maintain the contract with {{KafkaServer}} that after 
{{LogManager.shutdown}} is called once, all thread pools that close the logs 
are guaranteed to be attempted shutdown once before the control returns to the 
caller (either via an exception case or a success case). Supposing when 
multiple exceptions are raised when closing the logs (from different thread 
pools), probably we can just raise the first exception among these to the 
caller of LogManager.shutdown (i.e. {{KafkaServer}} ) while optionally logging 
the rest of the exceptions to stderr.


> LogManager shutdown error handler doesn't shutdown all internal thread pools
> ----------------------------------------------------------------------------
>
>                 Key: KAFKA-10723
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10723
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Kowshik Prakasam
>            Priority: Major
>
> *TL;DR:*
> The asynchronous shutdown in {{LogManager}} has the shortcoming that if 
> during shutdown any of the internal futures fail, then we do not always 
> ensure that all futures are completed before {{LogManager.shutdown}} returns. 
> As a result, despite the shut down completed message from KafkaServer is seen 
> in the error logs, some futures continue to run from inside LogManager 
> attempting to close the logs.
> *Description:*
> When LogManager is shutting down, exceptions in log closure are handled 
> [here|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogManager.scala#L497-L501].
>  However, this 
> [line|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogManager.scala#L502]
>  in the finally clause shutsdown the thread pools *asynchronously*.
> {noformat}
> threadPools.foreach(_.shutdown()){noformat}
> initiates an orderly shutdown (for each thread pool) in which previously 
> submitted tasks are executed, but no new tasks will be accepted (see javadoc 
> link 
> [here|https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html#shutdown()]).
>  As a result, if there is an exception during log closure, some of the thread 
> pools which are closing logs could be leaked and continue to run in the 
> background, after the control returns to the caller (i.e. {{KafkaServer}}). 
> As a result, even after the _shut down completed_ message is seen in the 
> error logs (originating from {{KafkaServer}} shutdown sequence), log closures 
> continue to happen in the background. 
>   
> *Proposed options for fixes:*
> It seems useful that we maintain the contract with {{KafkaServer}} that after 
> {{LogManager.shutdown}} is called once, all tasks that close the logs are 
> guaranteed to have completed before the call returns. There are probably 
> couple ways to fix this:
>  # A simple fix could be to replace `threadPools.foreach(_.shutdown())` with 
> `threadPools.foreach(_.awaitTermination())`. This ensures that we wait for 
> all threads to be shutdown before returning `LogManager.shutdown`.
>  # An advanced fix could be to just skip creating of checkpoint and clean 
> shutdown file only for the affected directory if any of its futures throw an 
> error. We continue to wait for all futures to complete for all directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to