Sent out a PR for the issues that I observed: https://github.com/apache/bookkeeper/pull/1240
On Thu, Mar 8, 2018 at 10:47 PM, Sijie Guo <[email protected]> wrote: > So the problem here is: > > - a corrupted request failed the V3 request decoder, so bookie switched to > use v2 request decoder. Once the switch happen, the bookie will always use > v2 request decoder decoding v3 request. then all your v3 requests will be > failing with unknown op and trigger the bytebuf double releasing issue. > > - a corrupted response failed the V3 response decoder, so client switched > to use v3 response decoder. Once the switch happen, the client will always > use v2 request decoder decoding v3 response. so all the responses will be > failing with " Received unknown response : op code" > > Although I don't know how the first request/response happened (it can be > any issue, even network corruption), the problem is once this happen, > either client/bookie will be forcing to use v2 request/response decoder and > never change. so the problem will remain until you restarted. And this is > the behavior that Enrico is seeing. > > There are a couple of issues to address here: > > 1) we need to add a flag to disable falling back to use v2 > request/response coder and make sure it always use v3 protocol. In this > way, we will guarantee the problem not persist in the channel level. > 2) we need to throw exception on unknown op code at request decode : > https://github.com/apache/bookkeeper/blob/master/ > bookkeeper-server/src/main/java/org/apache/bookkeeper/ > proto/BookieProtoEncoding.java#L195 . As what we did at response decoder > : https://github.com/apache/bookkeeper/blob/master/ > bookkeeper-server/src/main/java/org/apache/bookkeeper/ > proto/BookieProtoEncoding.java#L304 in https://github.com/apache/ > bookkeeper/issues/198 > > > Details are listed as below: > > -- > > Details: > > - The client side stacktrace clearly showed that it is using v2 decoder on > decoding responses. That means client failed to parse v3 response and it > falls back to use v2 decoder on decoding responses. Because it is a > "corrupted" v3 response, so v2 decoder can't > find a right op code. Then it throws illegal state exception. > > > *Caused by: java.lang.IllegalStateException: Received unknown response : > op code = 6 at > org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDeCoderPreV3.decode(BookieProtoEncoding.java:294)* > > - For the stacktrace at bookie side: > > 1. It is clear that BookieRequestHandler#L77 is called. That means the > message is neither v2 request nor v3 request. It is a ByteBuf. > > https://github.com/apache/bookkeeper/blob/master/ > bookkeeper-server/src/main/java/org/apache/bookkeeper/ > proto/BookieRequestHandler.java#L77 > > 2. V3 request decoder is using protobuf to decode bytes array. If it is > not a valid v3 request, it will always throw exceptions. so the code > mentioned above will never be triggered > > https://github.com/apache/bookkeeper/blob/master/ > bookkeeper-server/src/main/java/org/apache/bookkeeper/ > proto/BookieProtoEncoding.java#L344 > > 3. The only case that BookieRequestHandler#L77 can happen is v2 request > decoder doesn't parse the buffer. This leads to the bug in > > https://github.com/apache/bookkeeper/blob/master/ > bookkeeper-server/src/main/java/org/apache/bookkeeper/ > proto/BookieProtoEncoding.java#L194 > > > - How this can happen? > > If the client is using v3 protocol, but both client and bookie side > stacktraces show it is using v2 protocol, that means there was a corruption > causing client and bookie switching from v3 protocol to v2 protocol. > > https://github.com/apache/bookkeeper/blob/master/ > bookkeeper-server/src/main/java/org/apache/bookkeeper/ > proto/BookieProtoEncoding.java#L502 > https://github.com/apache/bookkeeper/blob/master/ > bookkeeper-server/src/main/java/org/apache/bookkeeper/ > proto/BookieProtoEncoding.java#L432 > > > > > > On Thu, Mar 8, 2018 at 12:18 AM, Enrico Olivelli <[email protected]> > wrote: > >> Hi all, >> I am seeing this bad errors in some test environments with 4.6.1 >> The errors appear during rolling restart of the application, the test env >> is made of 6 machines: >> - 3 bookies >> - 3 client machines (with several BK clients, of different sub-systems) >> - running with 4.6.1 both client and servers >> >> *I do not have* reports of this errors from production, already running >> 4.6.1 for the last month >> >> But the problem is quite scary >> >> This is a sample of relevant errors on clients (in this case Majordodo >> brokers, with log level = INFO) >> >> logs/org.apache.bookkeeper.client.PendingAddOp: >> 18-03-07-09-51-43 Write of ledger entry to quorum failed: L366634 >> E2557 >> 18-03-07-09-51-43 Write of ledger entry to quorum failed: L366634 >> E2558 >> 18-03-07-09-51-43 Write of ledger entry to quorum failed: L366634 >> E2559 >> 18-03-07-15-59-55 Failed to write entry (366680, 1865): Bookie >> operation timeout >> 18-03-07-15-59-55 Failed to write entry (366680, 1865): Bookie >> operation timeout >> 18-03-07-16-00-00 Failed to write entry (366680, 1865): Bookie >> operation timeout >> 18-03-07-16-00-00 Failed to write entry (366680, 1865): Bookie >> operation timeout >> 18-03-07-16-00-05 Failed to write entry (366680, 1865): Bookie >> operation timeout >> >> >> org.apache.bookkeeper.proto.PerChannelBookieClient: >> >> 18-03-07-10-06-44 Unexpected exception caught by bookie client >> channel handler >> 18-03-07-10-06-44 io.netty.handler.codec.DecoderException: >> java.lang.IllegalStateException: Received unknown response : op code = 6 >> io.netty.handler.codec.DecoderException: java.lang.IllegalStateException: >> Received unknown response : op code = 6 >> at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M >> essageToMessageDecoder.java:98) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:362) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:348) >> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >> ad(AbstractChannelHandlerContext.java:340) >> at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead( >> ByteToMessageDecoder.java:310) >> at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead( >> ByteToMessageDecoder.java:297) >> at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT >> oMessageDecoder.java:413) >> at io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte >> ToMessageDecoder.java:265) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:362) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:348) >> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >> ad(AbstractChannelHandlerContext.java:340) >> at io.netty.channel.DefaultChannelPipeline$HeadContext. >> channelRead(DefaultChannelPipeline.java:1414) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:362) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:348) >> at io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa >> ultChannelPipeline.java:945) >> at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStrea >> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) >> at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEven >> tLoop.java:404) >> at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop. >> java:304) >> at io.netty.util.concurrent.SingleThreadEventExecutor$5.run( >> SingleThreadEventExecutor.java:886) >> at java.base/java.lang.Thread.run(Thread.java:844) >> > > > > > >> Caused by: java.lang.IllegalStateException: Received unknown response : >> op code = 6 >> at org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDe >> CoderPreV3.decode(BookieProtoEncoding.java:294) >> at org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseDeco >> der.decode(BookieProtoEncoding.java:478) >> at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M >> essageToMessageDecoder.java:88) >> >> >> ... 19 more >> > > > > > > > >> >> >> Errors on Bookie side, with log level = INFO) >> >> >> io.netty.channel.DefaultChannelPipeline: >> 8-03-07-09-51-38 An exceptionCaught() event was fired, and it >> reached at the tail of the pipeline. It usually means the last handler in >> the pipeline did not handle the exception. >> 18-03-07-09-51-38 io.netty.util.IllegalReferenceCountException: >> refCnt: 0, increment: 1 >> io.netty.util.IllegalReferenceCountException: refCnt: 0, increment: 1 >> at io.netty.buffer.AbstractReferenceCountedByteBuf.release0(Abs >> tractReferenceCountedByteBuf.java:100) >> at io.netty.buffer.AbstractReferenceCountedByteBuf.release(Abst >> ractReferenceCountedByteBuf.java:84) >> at io.netty.util.ReferenceCountUtil.release(ReferenceCountUtil. >> java:88) >> at io.netty.channel.DefaultChannelPipeline.onUnhandledInboundMe >> ssage(DefaultChannelPipeline.java:1202) >> at io.netty.channel.DefaultChannelPipeline$TailContext. >> channelRead(DefaultChannelPipeline.java:1299) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:362) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:348) >> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >> ad(AbstractChannelHandlerContext.java:340) >> at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Ch >> annelInboundHandlerAdapter.java:86) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:362) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:348) >> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >> ad(AbstractChannelHandlerContext.java:340) >> at org.apache.bookkeeper.proto.BookieRequestHandler.channelRead >> (BookieRequestHandler.java:78) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:362) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:348) >> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >> ad(AbstractChannelHandlerContext.java:340) >> at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Ch >> annelInboundHandlerAdapter.java:86) >> at org.apache.bookkeeper.proto.AuthHandler$ServerSideHandler.ch >> annelRead(AuthHandler.java:88) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:362) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:348) >> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >> ad(AbstractChannelHandlerContext.java:340) >> at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M >> essageToMessageDecoder.java:102) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:362) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:348) >> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >> ad(AbstractChannelHandlerContext.java:340) >> at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead( >> ByteToMessageDecoder.java:310) >> at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead( >> ByteToMessageDecoder.java:297) >> at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT >> oMessageDecoder.java:413) >> at io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte >> ToMessageDecoder.java:265) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:362) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:348) >> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >> ad(AbstractChannelHandlerContext.java:340) >> at io.netty.channel.DefaultChannelPipeline$HeadContext. >> channelRead(DefaultChannelPipeline.java:1414) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:362) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:348) >> at io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa >> ultChannelPipeline.java:945) >> at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStrea >> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) >> at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEven >> tLoop.java:404) >> at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop. >> java:304) >> at io.netty.util.concurrent.SingleThreadEventExecutor$5.run( >> SingleThreadEventExecutor.java:886) >> at java.base/java.lang.Thread.run(Thread.java:844) >> >> >> >
