[ https://issues.apache.org/jira/browse/HBASE-28584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849365#comment-17849365 ]
Andrew Kyle Purtell commented on HBASE-28584: --------------------------------------------- In our crashes we have observed the frame triggering the SIGSEGV or SIGBUS to be native memcmp or memcpy... one of the stubs that our use of Unsafe leads to, so I have been focused perhaps unnecessarily on adding assertions to bring back bounds checking for debug in test environments. (Our use of Unsafe deliberately avoids the usual Java language and memory model safeties on purpose for performance.) However the frame in this report has no use of Unsafe {noformat} # Problematic frame: # J 24625 c2 org.apache.hadoop.hbase.util.ByteBufferUtils.copyBufferToStream(Ljava/io/OutputStream;Ljava/nio/ByteBuffer;II)V (75 bytes) @ 0x00007f7546873b69 [0x00007f7546873960+0x0000000000000209] {noformat} copyBufferToStream is straight java code using standard ByteBuffer APIs. Perhaps the issue is entirely due to Netty use of direct buffers, and Netty's requirement that we get reference counting correct in all circumstances. In that regard, I have been wondering if we are due to load and allocation pressure hitting race conditions we've always had but won't manifest unless under a lot of pressure. One potential case is we accidentally release a buffer that is still in use. In this scenario we would normally have enough time for the reader of the released buffer to complete, but, under high load and allocation pressure, the buffer would be quickly recycled, and the reader on the buffer would end up racing with either the reclamation or a new writer. So we should still debug with assertions (Preconditions) to try and catch this, but they should focus on asserting the validity, shape, and ownership of the direct buffer somehow. > RS SIGSEGV under heavy replication load > --------------------------------------- > > Key: HBASE-28584 > URL: https://issues.apache.org/jira/browse/HBASE-28584 > Project: HBase > Issue Type: Bug > Components: regionserver > Affects Versions: 2.5.6 > Environment: RHEL 7.9 > JDK 11.0.23 > Hadoop 3.2.4 > Hbase 2.5.6 > Reporter: Whitney Jackson > Priority: Major > > I'm observing RS crashes under heavy replication load: > > {code:java} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x00007f7546873b69, pid=29890, tid=36828 > # > # JRE version: Java(TM) SE Runtime Environment 18.9 (11.0.23+7) (build > 11.0.23+7-LTS-222) > # Java VM: Java HotSpot(TM) 64-Bit Server VM 18.9 (11.0.23+7-LTS-222, mixed > mode, tiered, compressed oops, g1 gc, linux-amd64) > # Problematic frame: > # J 24625 c2 > org.apache.hadoop.hbase.util.ByteBufferUtils.copyBufferToStream(Ljava/io/OutputStream;Ljava/nio/ByteBuffer;II)V > (75 bytes) @ 0x00007f7546873b69 [0x00007f7546873960+0x0000000000000209] > {code} > > The heavier load comes when a replication peer has been disabled for several > hours for patching etc. When the peer is re-enabled the replication load is > high until the peer is all caught up. The crashes happen on the cluster > receiving the replication edits. > > I believe this problem started after upgrading from 2.4.x to 2.5.x. > > One possibly relevant non-standard config I run with: > {code:java} > <property> > <name>hbase.region.store.parallel.put.limit</name> > <!-- Default: 10 --> > <value>100</value> > <description>Added after seeing "failed to accept edits" replication errors > in the destination region servers indicating this limit was being exceeded > while trying to process replication edits.</description> > </property> > {code} > > I understand from other Jiras that the problem is likely around direct memory > usage by Netty. I haven't yet tried switching the Netty allocator to > {{unpooled}} or {{{}heap{}}}. I also haven't yet tried any of the > {{io.netty.allocator.*}} options. > > {{MaxDirectMemorySize}} is set to 26g. > > Here's the full stack for the relevant thread: > > {code:java} > Stack: [0x00007f72e2e5f000,0x00007f72e2f60000], sp=0x00007f72e2f5e450, free > space=1021k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native > code) > J 24625 c2 > org.apache.hadoop.hbase.util.ByteBufferUtils.copyBufferToStream(Ljava/io/OutputStream;Ljava/nio/ByteBuffer;II)V > (75 bytes) @ 0x00007f7546873b69 [0x00007f7546873960+0x0000000000000209] > J 26253 c2 > org.apache.hadoop.hbase.ByteBufferKeyValue.write(Ljava/io/OutputStream;Z)I > (21 bytes) @ 0x00007f7545af2d84 [0x00007f7545af2d20+0x0000000000000064] > J 22971 c2 > org.apache.hadoop.hbase.codec.KeyValueCodecWithTags$KeyValueEncoder.write(Lorg/apache/hadoop/hbase/Cell;)V > (27 bytes) @ 0x00007f754663f700 [0x00007f754663f4c0+0x0000000000000240] > J 25251 c2 > org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.write(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V > (90 bytes) @ 0x00007f7546a53038 [0x00007f7546a50e60+0x00000000000021d8] > J 21182 c2 > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V > (73 bytes) @ 0x00007f7545f4d90c [0x00007f7545f4d3a0+0x000000000000056c] > J 21181 c2 > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.write(Ljava/lang/Object;ZLorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V > (149 bytes) @ 0x00007f7545fd680c [0x00007f7545fd65e0+0x000000000000022c] > J 25389 c2 org.apache.hadoop.hbase.ipc.NettyRpcConnection$$Lambda$247.run()V > (16 bytes) @ 0x00007f7546ade660 [0x00007f7546ade140+0x0000000000000520] > J 24098 c2 > org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(J)Z > (109 bytes) @ 0x00007f754678fbb8 [0x00007f754678f8e0+0x00000000000002d8] > J 27297% c2 > org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run()V (603 > bytes) @ 0x00007f75466c4d48 [0x00007f75466c4c80+0x00000000000000c8] > j > org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run()V+44 > j > org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run()V+11 > j > org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run()V+4 > J 12278 c1 java.lang.Thread.run()V java.base@11.0.23 (17 bytes) @ > 0x00007f753e11f084 [0x00007f753e11ef40+0x0000000000000144] > v ~StubRoutines::call_stub > V [libjvm.so+0x85574a] JavaCalls::call_helper(JavaValue*, methodHandle > const&, JavaCallArguments*, Thread*)+0x27a > V [libjvm.so+0x853d2e] JavaCalls::call_virtual(JavaValue*, Handle, Klass*, > Symbol*, Symbol*, Thread*)+0x19e > V [libjvm.so+0x8ffddf] thread_entry(JavaThread*, Thread*)+0x9f > V [libjvm.so+0xdb68d1] JavaThread::thread_main_inner()+0x131 > V [libjvm.so+0xdb2c4c] Thread::call_run()+0x13c > V [libjvm.so+0xc1f2e6] thread_native_entry(Thread*)+0xe6 > {code} > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)