I will send a report soon. With new debug I have some finding, I am looking into problems during restarts of bookies. Maybe there is some problem in error handling in PCBC.
Thank you Enrico 2018-03-12 10:58 GMT+01:00 Ivan Kelly <iv...@apache.org>: > Enrico, could you summarize what the state of things is now? What are > you running, what problems are you seeing and how are the problems > manifesting themselves. > > Regards, > Ivan > > On Mon, Mar 12, 2018 at 10:15 AM, Enrico Olivelli <eolive...@gmail.com> > wrote: > > Applyed Sijie's fixes and added some debug: > > > > Problem is triggered when you restart a bookie (I have a cluster of 3 > > bookies, WQ = 2 and AQ = 2) > > > > Below a new error on client side ("tailing" reader) > > > > Enrico > > > > this is a new error on client side: > > 18-03-12-09-11-45 Unexpected exception caught by bookie client channel > > handler > > 18-03-12-09-11-45 io.netty.handler.codec.DecoderException: > > com.google.protobuf.InvalidProtocolBufferException: Protocol message > > contained an invalid tag (zero). > > io.netty.handler.codec.DecoderException: > > com.google.protobuf.InvalidProtocolBufferException: Protocol message > > contained an invalid tag (zero). > > at > > io.netty.handler.codec.MessageToMessageDecoder.channelRead( > MessageToMessageDecoder.java:98) > > at > > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead( > AbstractChannelHandlerContext.java:362) > > at > > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead( > AbstractChannelHandlerContext.java:348) > > at > > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead( > AbstractChannelHandlerContext.java:340) > > at > > io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead( > ByteToMessageDecoder.java:310) > > at > > io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead( > ByteToMessageDecoder.java:297) > > at > > io.netty.handler.codec.ByteToMessageDecoder.callDecode( > ByteToMessageDecoder.java:413) > > at > > io.netty.handler.codec.ByteToMessageDecoder.channelRead( > ByteToMessageDecoder.java:265) > > at > > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead( > AbstractChannelHandlerContext.java:362) > > at > > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead( > AbstractChannelHandlerContext.java:348) > > at > > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead( > AbstractChannelHandlerContext.java:340) > > at > > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead( > DefaultChannelPipeline.java:1334) > > at > > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead( > AbstractChannelHandlerContext.java:362) > > at > > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead( > AbstractChannelHandlerContext.java:348) > > at > > io.netty.channel.DefaultChannelPipeline.fireChannelRead( > DefaultChannelPipeline.java:926) > > at > > io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe. > epollInReady(AbstractEpollStreamChannel.java:979) > > at > > io.netty.channel.epoll.EpollEventLoop.processReady( > EpollEventLoop.java:401) > > at io.netty.channel.epoll.EpollEventLoop.run( > EpollEventLoop.java:306) > > at > > io.netty.util.concurrent.SingleThreadEventExecutor$5. > run(SingleThreadEventExecutor.java:858) > > at java.base/java.lang.Thread.run(Thread.java:844) > > Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol > > message contained an invalid tag (zero). > > at > > com.google.protobuf.InvalidProtocolBufferException.invalidTag( > InvalidProtocolBufferException.java:105) > > at > > com.google.protobuf.CodedInputStream$StreamDecoder.readTag( > CodedInputStream.java:2060) > > at > > org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse.<init>( > BookkeeperProtocol.java:10009) > > at > > org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse.<init>( > BookkeeperProtocol.java:9975) > > at > > org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse$1. > parsePartialFrom(BookkeeperProtocol.java:10869) > > at > > org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse$1. > parsePartialFrom(BookkeeperProtocol.java:10864) > > at > > com.google.protobuf.CodedInputStream$StreamDecoder.readMessage( > CodedInputStream.java:2362) > > at > > org.apache.bookkeeper.proto.BookkeeperProtocol$Response.< > init>(BookkeeperProtocol.java:7852) > > at > > org.apache.bookkeeper.proto.BookkeeperProtocol$Response.< > init>(BookkeeperProtocol.java:7782) > > at > > org.apache.bookkeeper.proto.BookkeeperProtocol$Response$1. > parsePartialFrom(BookkeeperProtocol.java:9887) > > at > > org.apache.bookkeeper.proto.BookkeeperProtocol$Response$1. > parsePartialFrom(BookkeeperProtocol.java:9882) > > at > > com.google.protobuf.AbstractParser.parsePartialFrom( > AbstractParser.java:221) > > at com.google.protobuf.AbstractParser.parseFrom( > AbstractParser.java:239) > > at com.google.protobuf.AbstractParser.parseFrom( > AbstractParser.java:49) > > at > > com.google.protobuf.GeneratedMessageV3.parseWithIOException( > GeneratedMessageV3.java:320) > > at > > org.apache.bookkeeper.proto.BookkeeperProtocol$Response. > parseFrom(BookkeeperProtocol.java:8454) > > at > > org.apache.bookkeeper.proto.BookieProtoEncoding$ > ResponseEnDecoderV3.decode(BookieProtoEncoding.java:329) > > at > > org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseDecoder.decode( > BookieProtoEncoding.java:470) > > at > > io.netty.handler.codec.MessageToMessageDecoder.channelRead( > MessageToMessageDecoder.java:88) > > ... 19 more > > > > > > 2018-03-09 22:37 GMT+01:00 Sijie Guo <guosi...@gmail.com>: > > > >> Enrico, > >> > >> I would suggest you applied my fixes and then debug from there. In this > >> way, you will have a better sense where the first corruption is from. > >> > >> Sijie > >> > >> On Fri, Mar 9, 2018 at 11:48 AM Enrico Olivelli <eolive...@gmail.com> > >> wrote: > >> > >> > Il ven 9 mar 2018, 19:30 Enrico Olivelli <eolive...@gmail.com> ha > >> scritto: > >> > > >> > > Thank you Ivan! > >> > > I hope I did not mess up the dump and added ZK ports. We are not > using > >> > > standard ports and in that 3 machines there is also the 3 nodes zk > >> > > ensemble which is supporting BK and all the other parts of the > >> > application > >> > > > >> > > So one explanation would be that something is connecting to the > bookie > >> > and > >> > > this makes the bookie switch in a corrupted state by double > releasing a > >> > > bytebuf? > >> > > > >> > > >> > I did some experiments and it is easy to reproduce the bookie side > error > >> > and the double release with a forged sequence of bytes (just using nc > >> from > >> > the shell) > >> > But this seems not enough to break the bookie. > >> > I guess there is some corruption on client side and the error on the > >> bookie > >> > is only and effect, as Ivan is saying. > >> > My colleagues left the system running with a deep level of debug > during > >> > next weekend, hopefully we will get some other stacktrace > >> > > >> > Enrico > >> > > >> > > >> > > Enrico > >> > > > >> > > > >> > > Il ven 9 mar 2018, 18:23 Ivan Kelly <iv...@apache.org> ha scritto: > >> > > > >> > >> I need to sign off for the day. I've done some analysis of a > tcpdump > >> > >> enrico sent to me out of band (may contain sensitive info so best > not > >> > >> to post on public forum). > >> > >> > >> > >> I've attached a dump of just first bit of the header. Format is > >> > >> <sequence in dump> <whether a request or response>(<remote port>) > >> > >> <hexdump of payload> > >> > >> > >> > >> There are definitely corrupt packets coming from somewhere. Search > for > >> > >> lines with CORRUPT. > >> > >> > >> > >> <snip> > >> > >> 0247 - req (049546) - 00:00:00:08:ff:ff:ff:fe:00:00:00:0b > CORRUPT > >> > >> </snip> > >> > >> > >> > >> It's not clear whether these are originating at a valid client or > not. > >> > >> These trigger corrupt responses from the server, which I guess is > the > >> > >> double free manifesting itself. Strangely the > >> > >> corrupt message seems to have a lot of data in common with what > seems > >> > >> like an ok message (it's clearer on fixed width font). > >> > >> > >> > >> <snip> > >> > >> 0248 - resp(049720) - > >> > >> > >> > >> > >> > 00:00:00:54:00:03:00:89:00:00:02:86:00:07:e2:b1:00:00:00:00: > >> 00:00:02:86:00:05:e9:76:00:00 > >> > >> 0249 - resp(049546) - > >> > >> 00:00:00:10:ff:ff:ff:fe:00:00:02:86:00:07:e2:b1:00:00:00:00 > >> CORRUPT > >> > >> </snip> > >> > >> > >> > >> There's also some other weird traffic. Correct BK protobuf traffic > >> > >> should be <4 bytes len>:00:03:.... > >> > >> There seems to be other traffic which is being accepted at the same > >> > >> port, but looks like ZK traffic. > >> > >> > >> > >> Anyhow, I'll dig more on monday. > >> > >> > >> > >> -Ivan > >> > >> > >> > >> > >> > >> On Fri, Mar 9, 2018 at 3:27 PM, Ivan Kelly <iv...@apache.org> > wrote: > >> > >> > On Fri, Mar 9, 2018 at 3:20 PM, Enrico Olivelli < > >> eolive...@gmail.com> > >> > >> wrote: > >> > >> >> Bookies > >> > >> >> 10.168.10.117:1822 -> bad bookie with 4.1.21 > >> > >> >> 10.168.10.116:1822 -> bookie with 4.1.12 > >> > >> >> 10.168.10.118:1281 -> bookie with 4.1.12 > >> > >> >> > >> > >> >> 10.168.10.117 client machine on which I have 4.1.21 client > >> (different > >> > >> >> process than the bookie one) > >> > >> > Oh. This dump won't have the stream we need then, as that will > be on > >> > >> > loopback. Try adding "-i any" to the tcpdump. Sorry, I didn't > >> realize > >> > >> > your clients and servers are colocated. > >> > >> > > >> > >> > -Ivan > >> > >> > >> > > -- > >> > > > >> > > > >> > > -- Enrico Olivelli > >> > > > >> > -- > >> > > >> > > >> > -- Enrico Olivelli > >> > > >> >