RaulGracia opened a new issue #2497:
URL: https://github.com/apache/bookkeeper/issues/2497


   **BUG REPORT**
   
   ***Describe the bug***
   
   We have been consistently observing (last months) that, under high load 
situations, Bookies (v4.9.2) may permanently crash. At first glance, the main 
reason for the problem seems that at some point, messages from Bookkeeper 
client (v4.9.2) could be malformed/corrupted, leading the Bookie to crash due 
to unknown op codes (`IllegalStateException`) and Netty frames being too long 
(`TooLongFrameException`).
   
   ***To Reproduce***
   
   Whenever we sustain a load close to the max Bookie capacity for a sufficient 
amount of time, we often find this problem. We have seen it in different 
environments and with diverse types of drives. Note that the problems is not 
deterministic, so it may take some time (minutes?) to appear (e.g., depends if 
the Bookie is also busy with other tasks, like garbage collection), but 
eventually it appears.
   
   ***Expected behavior***
   
   We expect 3 main behaviors from Bookkeeper:
   - If the problem is related to some kind of data malformation/corruption, 
then it needs to be fixed.
   - A back-pressure mechanism to prevent crashes and adjust the client's write 
throughput to what the server can absorb.
   - In case of a malformed message (i.e., `TooLongFrameException`), we would 
expect the server to just discard the message and continue working as usual.
   
   ***Screenshots***
   n/a
   
   ***Additional context***
   This problem normally is seen in our benchmark experiments when we induce 
loads to Bookkeeper close to its maximum capacity. In this scenario, instead of 
inducing enough back-pressure to the client, what we observe is a problem as 
follows:
   ```
   Nov 24 12:18:30 ip-10-0-0-119.us-west-2.compute.internal bookkeeper[17271]: 
2020-11-24 12:18:30,193 - ERROR - [bookie-io-1-2:BookieRequestHandler@70] - 
Unhandled exception occurred in I/O thread or handler
   Nov 24 12:18:30 ip-10-0-0-119.us-west-2.compute.internal bookkeeper[17271]: 
java.lang.IllegalStateException: Received unknown request op code = 84
   Nov 24 12:18:30 ip-10-0-0-119.us-west-2.compute.internal bookkeeper[17271]: 
at 
org.apache.bookkeeper.proto.BookieProtoEncoding$RequestEnDeCoderPreV3.decode(BookieProtoEncoding.java:194)
   Nov 24 12:18:30 ip-10-0-0-119.us-west-2.compute.internal bookkeeper[17271]: 
at 
org.apache.bookkeeper.proto.BookieProtoEncoding$RequestDecoder.channelRead(BookieProtoEncoding.java:436)
   ...
   Nov 24 12:18:30 ip-10-0-0-119.us-west-2.compute.internal bookkeeper[17271]: 
2020-11-24 12:18:30,196 - ERROR - [bookie-io-1-2:BookieRequestHandler@70] - 
Unhandled exception occurred in I/O thread or handler
   Nov 24 12:18:30 ip-10-0-0-119.us-west-2.compute.internal bookkeeper[17271]: 
io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 
5242880: 1556335546 - discarded
   Nov 24 12:18:30 ip-10-0-0-119.us-west-2.compute.internal bookkeeper[17271]: 
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:522)
   Nov 24 12:18:30 ip-10-0-0-119.us-west-2.compute.internal bookkeeper[17271]: 
at 
io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:500)
   ```
   From this error, we extract the following observations:
   - The `TooLongFrameException` is indicating that we are sending a message 
that is too large compared to the `nettyMaxFrameSize` configured in the server. 
However, there is no way in which we are sending a message of 1.5GB in size. 
The events we are writing to the Bookkeeper client are strictly smaller than 
1MB.
   - The problem seems related to a corruption of the messaged already sent to 
the Bookkeeper client, given that we also see a problem like 
`java.lang.IllegalStateException: Received unknown request op code = 84`. This 
means that the header itself of the message has been corrupted, and I don't 
know how a client application could modify that.
   - Another aspect that points to a problem in the Bookkeeper client (or maybe 
its interactions with Netty) is that the 3 Bookies used in this experiment got 
the same exception at the same time, so in my view this discards that the 
problem is server related.
   
   I'm attaching the Bookkeeper logs of this reproduction of the problem.
   
[bookkeeper-10.0.0.119.log.gz](https://github.com/apache/bookkeeper/files/5595823/bookkeeper-10.0.0.119.log.gz)
   
[bookkeeper-10.0.0.240.log.gz](https://github.com/apache/bookkeeper/files/5595824/bookkeeper-10.0.0.240.log.gz)
   
[bookkeeper-10.0.0.251.log.gz](https://github.com/apache/bookkeeper/files/5595825/bookkeeper-10.0.0.251.log.gz)
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to