[ 
https://issues.apache.org/jira/browse/KAFKA-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neha Narkhede updated KAFKA-1108:
---------------------------------
    Labels: newbie  (was: )

> when controlled shutdown attempt fails, the reason is not always logged
> -----------------------------------------------------------------------
>
>                 Key: KAFKA-1108
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1108
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jason Rosenberg
>              Labels: newbie
>             Fix For: 0.9.0
>
>
> In KafkaServer.controlledShutdown(), it initiates a controlled shutdown, and 
> then if there's a failure, it will retry the controlledShutdown.
> Looking at the code, there are 2 ways a retry could fail, one with an error 
> response from the controller, and this messaging code:
> {code}
> info("Remaining partitions to move: 
> %s".format(shutdownResponse.partitionsRemaining.mkString(",")))
> info("Error code from controller: %d".format(shutdownResponse.errorCode))
> {code}
> Alternatively, there could be an IOException, with this code executed:
> {code}
>             catch {
>               case ioe: java.io.IOException =>
>                 channel.disconnect()
>                 channel = null
>                 // ignore and try again
>             }
> {code}
> And then finally, in either case:
> {code}
>           if (!shutdownSuceeded) {
>             Thread.sleep(config.controlledShutdownRetryBackoffMs)
>             warn("Retrying controlled shutdown after the previous attempt 
> failed...")
>           }
> {code}
> It would be nice if the nature of the IOException were logged in either case 
> (I'd be happy with an ioe.getMessage() instead of a full stack trace, as 
> kafka in general tends to be too willing to dump IOException stack traces!).
> I suspect, in my case, the actual IOException is a socket timeout (as the 
> time between initial "Starting controlled shutdown...." and the first 
> "Retrying..." message is usually about 35 seconds (the socket timeout + the 
> controlled shutdown retry backoff).  So, it would seem that really, the issue 
> in this case is that controlled shutdown is taking too long.  It would seem 
> sensible instead to have the controller report back to the server (before the 
> socket timeout) that more time is needed, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to