horizonzy opened a new issue, #3527: URL: https://github.com/apache/bookkeeper/issues/3527
**BUG REPORT** Recently, in our customer product, we found that the bookie direct memory increases slowly over time, finally it throws netty allocate direct memory failed exception, then bookie restart. **Dump info** So we dump the heap and found that there are some unreachable `ReadResponse`, and the data property `ByteBuf` refCnt is 2, and the `ByteBuf` is also unreachable, so there is no chance to release the `ByteBuf`, it will be deleted after JVM GC. <img width="1563" alt="image" src="https://user-images.githubusercontent.com/22524871/195585954-608d8da5-de32-480e-91a5-d5b535eac6de.png"> See the picture, we can think the memory leak already happen. **Log info** In the bookie log, we found there are many disconnect info like: ``` "Oct 12, 2022 @ 07:47:28.673","","2022-10-12T07:47:28,672+0000 [bookie-io-1-6] INFO org.apache.bookkeeper.proto.BookieRequestHandler - Channels disconnected: [id: 0x111df957, L:/xx.xx.xx.xx:3181 ! R:/xx.xx.xx.xx:44396]" ``` **The Root Case** The memory leak occurred in an edge scene. In the normal case, we get the data from BookieImpl and wrap it in ReadResponse, and then write the [ReadResponse](https://github.com/apache/bookkeeper/blob/62155369d199f36720938e913541b223dab8b047/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/BookieProtocol.java#L448) to the netty channel, there is our netty handler to handle the ReadResponse. * [ResponseEncoder](https://github.com/apache/bookkeeper/blob/62155369d199f36720938e913541b223dab8b047/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/BookieProtoEncoding.java#L458) It transfer the ReadResponse to ByteBufList, see [encode](https://github.com/apache/bookkeeper/blob/62155369d199f36720938e913541b223dab8b047/bookkeeper-server/src/main/java/org/apache/bookkeeper/proto/BookieProtoEncoding.java#L251) The [ByteBufList](https://github.com/apache/bookkeeper/blob/62155369d199f36720938e913541b223dab8b047/bookkeeper-server/src/main/java/org/apache/bookkeeper/util/ByteBufList.java#L63) is implemented interce `ReferenceCounted`, so after netty flush it, netty will release it by invoke `io.netty.util.ReferenceCountUtil#release(java.lang.Object)` ``` public static boolean release(Object msg, int decrement) { ObjectUtil.checkPositive(decrement, "decrement"); if (msg instanceof ReferenceCounted) { return ((ReferenceCounted) msg).release(decrement); } return false; } ``` In the `io.netty.util.ReferenceCountUtil#release(java.lang.Object)`, it will check if the msg is implemented ReferenceCounted. If not, do nothing. There is an important netty mechanism you should know, after a connection disconnect, there are two things that happen. 1. all the handlers will be removed from the ChannelPipeline after channelUnregistered. https://github.com/netty/netty/blob/7971075bbe9f5509c8b20c0e702ec2affb37d76e/transport/src/main/java/io/netty/channel/DefaultChannelPipeline.java#L1387-L1394 At line_1392, it will remove all the handlers, only leaving two netty default contextHandler in the ChannelPipeline, [HeadContext](https://github.com/netty/netty/blob/7971075bbe9f5509c8b20c0e702ec2affb37d76e/transport/src/main/java/io/netty/channel/DefaultChannelPipeline.java#L1305) and [TailContext](https://github.com/netty/netty/blob/7971075bbe9f5509c8b20c0e702ec2affb37d76e/transport/src/main/java/io/netty/channel/DefaultChannelPipeline.java#L1245) 2. Make AbstractChannel.AbstractUnsafe#outboundBuffer = null. https://github.com/netty/netty/blob/7971075bbe9f5509c8b20c0e702ec2affb37d76e/transport/src/main/java/io/netty/channel/AbstractChannel.java#L703 So we know, after the connection disconnect, the channel only left HeadContext and TailContext, and the AbstractChannel.AbstractUnsafe#outboundBuffer is null. If the bookie still write ReadResponse to netty channel, the `ResponseEncoder` didn't work(already removed), the HeadContext will write ReadResponse to AbstractUnsafe#outboundBuffer directly. https://github.com/netty/netty/blob/7971075bbe9f5509c8b20c0e702ec2affb37d76e/transport/src/main/java/io/netty/channel/AbstractChannel.java#L847-L864 see line_851, the outboundBuffer is null, it will release ReadResponse, but the ReadResponse didn't implement ReferenceCounted, so it didn't release the ReadResponse.data. **Memory leak**!!! ***To Reproduce*** Steps to reproduce the behavior: 1. Start bookies. 2. Start a bookkeeper client using V2 protocol. 3. use the bookkeeper client to read entries. 4. In the reading process, kill -9 bookkeeper client programmer. 5. Bookie server will memory leak. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
