[ 
https://issues.apache.org/jira/browse/HBASE-26092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381512#comment-17381512
 ] 

Michael Stack commented on HBASE-26092:
---------------------------------------

With replication enabled on a ~700 node cluster, we'd lose a RS every day or so 
w/ crashes that were variants on the below (building cellblock):
{code:java}
Stack: [0x00007edc2b215000,0x00007edc2b316000],  sp=0x00007edc2b314480,  free 
space=1021kNative frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
C=native code)J 12332 C2 
org.apache.hadoop.hbase.codec.KeyValueCodecWithTags$KeyValueEncoder.write(Lorg/apache/hadoop/hbase/Cell;)V
 (27 bytes) @ 0x00007f065ada3047 [0x00007f065ada2c40+0x407]J 16249 C2 
org.apache.hadoop.hbase.ipc.CellBlockBuilder.encodeCellsTo(Ljava/io/OutputStream;Lorg/apache/hadoop/hbase/CellScanner;Lorg/apache/hadoop/hbase/codec/Codec;Lorg/apache/hadoop/io/compress/CompressionCodec;)V
 (138 bytes) @ 0x00007f065b716550 [0x00007f065b716380+0x1d0]J 6822 C2 
org.apache.hadoop.hbase.ipc.CellBlockBuilder.buildCellBlock(Lorg/apache/hadoop/hbase/codec/Codec;Lorg/apache/hadoop/io/compress/CompressionCodec;Lorg/apache/hadoop/hbase/CellScanner;Lorg/apache/hadoop/hbase/ipc/CellBlockBuilder$OutputStreamSupplier;)Z
 (113 bytes) @ 0x00007f0659917424 [0x00007f0659916fc0+0x464]J 6824 C2 
org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.writeRequest(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Lorg/apache/hadoop/hbase/ipc/Call;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
 (370 bytes) @ 0x00007f065a4041f4 [0x00007f065a403fc0+0x234]J 6823 C2 
org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.write(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
 (30 bytes) @ 0x00007f065962d414 [0x00007f065962d3e0+0x34]J 5492 C2 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.write(Ljava/lang/Object;ZLorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
 (149 bytes) @ 0x00007f0659f04f48 [0x00007f0659f04c60+0x2e8]J 6996 C2 
org.apache.hadoop.hbase.ipc.NettyRpcConnection$6$1.run()V (22 bytes) @ 
0x00007f06599d4eec [0x00007f06599d4c80+0x26c]J 27396 C2 
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(J)Z
 (106 bytes) @ 0x00007f065c15e660 [0x00007f065c15e400+0x260]J 21998% C2 
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run()V (461 
bytes) @ 0x00007f0659de9570 [0x00007f0659de9000+0x570]j  
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run()V+44j
  
org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run()V+11j
  
org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run()V+4
 {code}

> JVM core dump in the replication path
> -------------------------------------
>
>                 Key: HBASE-26092
>                 URL: https://issues.apache.org/jira/browse/HBASE-26092
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 2.3.5
>            Reporter: Huaxiang Sun
>            Priority: Critical
>
> When replication is turned on, we found the following code dump in the region 
> server. 
> I checked the code dump for replication. I think I got some ideas. For 
> replication, when RS receives walEdits from remote cluster, it needs to send 
> them out to final RS. In this case, NettyRpcConnection is deployed, calls are 
> queued while it refers to ByteBuffer in the context of replicationHandler 
> (returned to the pool once it returns). Code dump will happen since the 
> byteBuffer has been reused. Needs ref count in this asynchronous processing.
>  
> Feel free to take it, otherwise, I will try to work on a patch later.
>  
>  
> {code:java}
> Stack: [0x00007fb1bf039000,0x00007fb1bf13a000],  sp=0x00007fb1bf138560,  free 
> space=1021k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> J 28175 C2 
> org.apache.hadoop.hbase.ByteBufferKeyValue.write(Ljava/io/OutputStream;Z)I 
> (21 bytes) @ 0x00007fdbbbb2663c [0x00007fdbbbb263c0+0x27c]
> J 14912 C2 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.writeRequest(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Lorg/apache/hadoop/hbase/ipc/Call;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
>  (370 bytes) @ 0x00007fdbbb94b590 [0x00007fdbbb949c00+0x1990]
> J 14911 C2 
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.write(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
>  (30 bytes) @ 0x00007fdbb972d1d4 [0x00007fdbb972d1a0+0x34]
> J 30476 C2 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.write(Ljava/lang/Object;ZLorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
>  (149 bytes) @ 0x00007fdbbd4e7084 [0x00007fdbbd4e6900+0x784]
> J 14914 C2 org.apache.hadoop.hbase.ipc.NettyRpcConnection$6$1.run()V (22 
> bytes) @ 0x00007fdbbb9344ec [0x00007fdbbb934280+0x26c]
> J 23528 C2 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(J)Z
>  (106 bytes) @ 0x00007fdbbcbb0efc [0x00007fdbbcbb0c40+0x2bc]
> J 15987% C2 
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run()V (461 
> bytes) @ 0x00007fdbbbaf1580 [0x00007fdbbbaf1360+0x220]
> j  
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run()V+44
> j  
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run()V+11
> j  
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run()V+4
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to