Re: Review Request 36664: Patch for KAFKA-2353

Gwen Shapira Wed, 22 Jul 2015 13:56:51 -0700


> On July 21, 2015, 11:15 p.m., Gwen Shapira wrote:
> > core/src/main/scala/kafka/network/SocketServer.scala, line 465
> > <https://reviews.apache.org/r/36664/diff/1/?file=1018238#file1018238line465>
> >
> >     Turns out that catching Throwable is a really bad idea: 
> > https://www.sumologic.com/2014/05/05/why-you-should-never-catch-throwable-in-scala/
> 
> Jiangjie Qin wrote:
>     Ah... Didn't know that before. I explicitly listed the exceptions.
> 
> Guozhang Wang wrote:
>     Searching ": Throwable" gives me 180+ cases in code base :P Though many 
> of them are from unit tests (which, arguably maybe OK) there are still a lot 
> in the core package. I agree that we should avoid catching Throwable whenever 
> possible, which will also help enforcing the developers to think about 
> possible checked exceptions in the calling trace.
> 
> Gwen Shapira wrote:
>     I know :(
>     I'm not sure if going over and converting everything is worth the effort. 
> Although it can be a nice newbie jira.
> 
> Jiangjie Qin wrote:
>     Maybe we can simply change that to catch Throwables except 
> ControlThrowables? That might be a simple search and replace.


possible. definitely not in this JIRA though :)


> On July 21, 2015, 11:15 p.m., Gwen Shapira wrote:
> > core/src/main/scala/kafka/network/SocketServer.scala, line 400
> > <https://reviews.apache.org/r/36664/diff/1/?file=1018238#file1018238line400>
> >
> >     So in case of unexpected exception, we log an error and keep running?
> >     
> >     Isn't it better to kill the processor, since we don't know what's the 
> > state of the system? If the acceptor keeps placing messages in the queue 
> > for a dead processor, isn't it a separate issue?
> 
> Jiangjie Qin wrote:
>     This part I'm not quite sure. I am not very experienced in the error 
> handling in such case, so please correct me if I missed something.
>     Here is what I thought.
>     
>     The way it currently works is that the acceptor will 
>     1. accept new connection request and create new socket channel
>     2. choose a processor and put the socket channel into the processor's new 
> connection queue
>     
>     The processor will just take the socket channels from the queue and 
> register it to the selector.
>     If the processor runs and get an uncaught exception, there are several 
> possibilities. 
>     Case 1: The exception was from one socket channel. 
>     Case 2: The exception was associated with a bad request. 
>     In case 1, ideally we should just disconnect that socket channel without 
> affecting other socket channels.
>     In case 2, I think we should log the error and skip the message - 
> assuming client will retry sending data if no response was received for a 
> given peoriod of time.
>     
>     I am not sure if letting processor exit is a good idea because this will 
> lead to the result of a badly behaving client screw the entire cluster - it 
> might screw processors one by one. Comparing with that, I kind of leaning 
> towards keeping the processor running and serving other normal TCP 
> connections if possible, but log the error so monitoring system can detect 
> and see if human intervention is needed.
>     
>     Also, I don't know what to do here to prevent the thread from exiting 
> without catching all the throwables.
>     According to this blog 
> http://www.tzavellas.com/techblog/2010/09/20/catching-throwable-in-scala/
>     I guess I can rethrow all the ControlThrowables, but intercept the rests?
> 
> Guozhang Wang wrote:
>     I would also prefer not to close the Processor thread upon exceptions, 
> mainly for avoid one bad client killing a shared Kafka cluster. In the past 
> we have seen such issues like DDoS MetadataRequest killing the cluster and 
> all other clients gets affected, etc, and the quota work is towards 
> preventing it. Since Processor threads are shared (8 by default on a broker), 
> it should not be closed by a single socket / bad client request.
> 
> Gwen Shapira wrote:
>     I like your thinking around cases #1 and #2. I think this should go as a 
> code comment somewhere, so when people improve / extend SocketServer they 
> will keep this logic in mind. Maybe even specify in specific catch clauses if 
> they are handling possible errors in request level or channel level.
>     
>     My concern is with possible case #3: Each processor has an 
> o.a.k.common.network.Selector. I'm concerned about the possibility of 
> something going wrong in the state of the selector, which will possibly be an 
> issue for all channels. For example failure to register could be an issue 
> with the channel.register call, but also perhaps an issue with keys.put (just 
> an example - I'm not sure something can actually break keys table). 
>     
>     I'd like to be able to identify cases where the Selector state may have 
> gone wrong and close the processor in that case. Does that make any sense? Or 
> am I being too paranoid?
> 
> Jiangjie Qin wrote:
>     Hi Gwen, I think what you said makes sense. Maybe I see this more from a 
> failure boundary point of view. 
>     
>     Actually we might need to do more if we let a processor exit. We need to 
> stop the acceptor from giving more channel to that acceptor, and potentially 
> stop the entire broker to prevent further issue. This might be a larger 
> failure scenario from what I can see.
>     
>     If we limit the impact to a processor, even if the processor has 
> defunctioned, letting it keep running would limit the impact within that 
> processor but not affect other processors. The connecting to this processor 
> will eventually disconnect and retry to connect to other processor. So the 
> broker won't stop serving when a processor stop working.
>     
>     I completely agree we would prefer to stop the service if something 
> really bad and unknown occured. But honestly I'm not sure how to exhaust all 
> the possible known throwables. In practise, I haven't seen any error of case 
> #3. So pragmatically I assume they are so rare and prefer to just catch all 
> the exceptions.

As mentioned on the mailing list - I'm convinced that #3 is an edge case that 
is easy for humans to recognize but isn't trivial to handle programatically at 
this time.

So lets go with your logic, just please put your explanation of the two cases 
we are handling (request issues and channel issues) in a code comment, and 
point out which catch clause covers each failure scenario.


- Gwen


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36664/#review92496
-----------------------------------------------------------


On July 22, 2015, 5:02 a.m., Jiangjie Qin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/36664/
> -----------------------------------------------------------
> 
> (Updated July 22, 2015, 5:02 a.m.)
> 
> 
> Review request for kafka.
> 
> 
> Bugs: KAFKA-2353
>     https://issues.apache.org/jira/browse/KAFKA-2353
> 
> 
> Repository: kafka
> 
> 
> Description
> -------
> 
> Addressed Gwen's comments
> 
> 
> Diffs
> -----
> 
>   core/src/main/scala/kafka/network/SocketServer.scala 
> 91319fa010b140cca632e5fa8050509bd2295fc9 
> 
> Diff: https://reviews.apache.org/r/36664/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jiangjie Qin
> 
>

Re: Review Request 36664: Patch for KAFKA-2353

Reply via email to