[ 
https://issues.apache.org/jira/browse/ARTEMIS-2696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133075#comment-17133075
 ] 

Bas commented on ARTEMIS-2696:
------------------------------

Had some success in avoiding crashes and netty leak reports. Started a server 
with a patched version which ran stable for 20 hours. This is the longest we 
have seen it running until now. Will check during the day if it stays that way.

The fix is adding a bytebuffer release to the ProtonHandler. The part which is 
mentioned in the netty LEAK report.

I don't understand why this suddenly is an issue going from 2.10.1 to 2.11/2.13 
since I see no changes directly in the class. Could be changes in the 
AMQPConnectionContext. Currently running this patch:

 

diff --git 
a/artemis-protocols/artemis-amqp-protocol/src/main/java/org/apache/activemq/artemis/protocol/amqp/proton/handler/ProtonHandler.java
 
b/artemis-protocols/artemis-amqp-protocol/src/main/java/org/apache/activemq/artemis/protocol/amqp/proton/handler/ProtonHandler.java
index 8bfc3ea..d14e15f 100644
--- 
a/artemis-protocols/artemis-amqp-protocol/src/main/java/org/apache/activemq/artemis/protocol/amqp/proton/handler/ProtonHandler.java
+++ 
b/artemis-protocols/artemis-amqp-protocol/src/main/java/org/apache/activemq/artemis/protocol/amqp/proton/handler/ProtonHandler.java
@@ -276,12 +276,17 @@
 break;
 }
 
- // We allocated a Pooled Direct Buffer, that will be sent down the stream
 ByteBuf buffer = PooledByteBufAllocator.DEFAULT.directBuffer(pending);
- buffer.writeBytes(head);
+ try {
+ // We allocated a Pooled Direct Buffer, that will be sent down the stream
+ buffer.writeBytes(head);
 
- for (EventHandler handler : handlers) {
- handler.pushBytes(buffer);
+ for (EventHandler handler : handlers) {
+ handler.pushBytes(buffer);
+ }
+ } finally {
+ // We need to release the buffer when it has been sent downstream
+ buffer.release();
 }
 
 transport.pop(pending);

 

> Netty DirectBufferLeak noticed in the log and server not processing messages 
> with out of direct memory errors after running for a day
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARTEMIS-2696
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2696
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.11.0, 2.13.0
>         Environment: Ubuntu 18.04.1 LTS
> Netty 4.1.45
> Qpid 0.48
>  
>  
>            Reporter: Bas
>            Priority: Major
>
> Our servers started to crash daily once we were on the 2.11.0 version.
> The crashes were not killing the entire artemis server but will cause endless 
> log statements of the following:
> Transport failed: io.netty.util.internal.OutOfDirectMemoryError: failed to 
> allocate 16777216 byte(s) of direct memory (used: 637534215, max: 652738560)
> We launched an investigation and activated the netty advanced leak detection: 
> -Dio.netty.leakDetectionLevel=advanced
> After which we indeed noticed leak detection log entries with following 
> information:
> #1:
> io.netty.buffer.AdvancedLeakAwareByteBuf.readBytes(AdvancedLeakAwareByteBuf.java:484)
>  
> org.apache.activemq.artemis.core.server.protocol.websocket.WebSocketFrameEncoder.writeContinuationFrame(WebSocketFrameEncoder.java:56)
>  
> org.apache.activemq.artemis.core.server.protocol.websocket.WebSocketFrameEncoder.write(WebSocketFrameEncoder.java:45)
>  
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:715)
>  
> io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:762)
>  
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:788)
>  
> io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:756)
>  
> io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1020)
>  io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:299) 
> org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnection.write(NettyConnection.java:382)
>  
> org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnection.write(NettyConnection.java:286)
>  
> org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnection.write(NettyConnection.java:281)
>  
> org.apache.activemq.artemis.protocol.amqp.broker.AMQPConnectionCallback.onTransport(AMQPConnectionCallback.java:203)
>  
> org.apache.activemq.artemis.protocol.amqp.proton.AMQPConnectionContext.pushBytes(AMQPConnectionContext.java:384)
>  
> org.apache.activemq.artemis.protocol.amqp.proton.handler.ProtonHandler.actualFlush(ProtonHandler.java:210)
>  
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>  
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>  io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>  io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
>  
> #2: Hint: 'websocket-frame-encoder' will handle the message from this point. 
> io.netty.channel.DefaultChannelPipeline.touch(DefaultChannelPipeline.java:116)
>  
> io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:784)
>  
> io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:756)
>  
> io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1020)
>  io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:299) 
> org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnection.write(NettyConnection.java:382)
>  
> org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnection.write(NettyConnection.java:286)
>  
> org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnection.write(NettyConnection.java:281)
>  
> org.apache.activemq.artemis.protocol.amqp.broker.AMQPConnectionCallback.onTransport(AMQPConnectionCallback.java:203)
>  
> org.apache.activemq.artemis.protocol.amqp.proton.AMQPConnectionContext.pushBytes(AMQPConnectionContext.java:384)
>  
> org.apache.activemq.artemis.protocol.amqp.proton.handler.ProtonHandler.actualFlush(ProtonHandler.java:210)
>  
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>  
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>  io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>  io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
>  
> #3: 
> io.netty.buffer.AdvancedLeakAwareByteBuf.writeBytes(AdvancedLeakAwareByteBuf.java:622)
>  
> org.apache.activemq.artemis.protocol.amqp.proton.handler.ProtonHandler.actualFlush(ProtonHandler.java:207)
>  
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>  
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>  io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>  io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
>  
> Created at: 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:349)
>  
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)
>  
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:178)
>  
> org.apache.activemq.artemis.protocol.amqp.proton.handler.ProtonHandler.actualFlush(ProtonHandler.java:206)
>  
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>  
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>  io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>  io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
>  
> We looked into the source code of some classes in the above stack traces to 
> look for changes where a directbuffer.release() would be missing but did not 
> find any. It is also though to grasp what is happening in these classes 
> because I really need to look into the concepts here if I want to understand 
> what is happening and what needs to happen. Which will take a lot of time. 
> But something got my attention where I might be totally wrong in my 
> estimation as being related to the issue.
>  
> The only thing I have on this is the following information has a direct 
> relation with a class in the leak report from netty: 
> [https://github.com/apache/activemq-artemis/blob/master/artemis-server/src/main/java/org/apache/activemq/artemis/core/server/protocol/websocket/WebSocketFrameEncoder.java]
> This class is added in following commit doing something with a websocket 
> continuation frame:
> [https://github.com/apache/activemq-artemis/commit/9fac4b866cf9cefbb6f7c13b820e07455b6649f5]
>  
> This commit is not in 2.10:
> git tag --contains 9fac4b8
> 2.11.0
> So, we wanted to test if doing a rollback to 2.10 would fix the issue. The 
> 2.10 version is now running for 3 days without having issues while with 2.11 
> it would have crashed with out of memory errors 3 times already.
>  
> The issue could be in a different commit/location than what is analysed above 
> but 2.10 seems to work without issues.
> We also wiped our entine journal before going to 2.10 so I want to test the 
> 2.11 again with a clean journal and see if wiping the data store could have 
> caused 2.10 to function better. So more information will follow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to