[jira] [Resolved] (HBASE-28448) CompressionTest hangs when run over a Ozone ofs path

2024-05-09 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HBASE-28448.
-
Resolution: Fixed

> CompressionTest hangs when run over a Ozone ofs path
> 
>
> Key: HBASE-28448
> URL: https://issues.apache.org/jira/browse/HBASE-28448
> Project: HBase
>  Issue Type: Bug
>Reporter: Pratyush Bhatt
>Assignee: Wei-Chiu Chuang
>Priority: Major
>  Labels: ozone, pull-request-available
> Fix For: 4.0.0-alpha-1, 2.7.0, 3.0.0-beta-2, 2.6.1
>
> Attachments: hbase_ozone_compression.jstack
>
>
> If we run the Compression test over HDFS path, it works fine:
> {code:java}
> hbase org.apache.hadoop.hbase.util.CompressionTest 
> hdfs://ns1/tmp/dir1/dir2/test_file.txt snappy
> 24/03/20 06:08:43 WARN impl.MetricsConfig: Cannot locate configuration: tried 
> hadoop-metrics2-hbase.properties,hadoop-metrics2.properties
> 24/03/20 06:08:43 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot 
> period at 10 second(s).
> 24/03/20 06:08:43 INFO impl.MetricsSystemImpl: HBase metrics system started
> 24/03/20 06:08:43 INFO metrics.MetricRegistries: Loaded MetricRegistries 
> class org.apache.hadoop.hbase.metrics.impl.MetricRegistriesImpl
> 24/03/20 06:08:43 INFO compress.CodecPool: Got brand-new compressor [.snappy]
> 24/03/20 06:08:43 INFO compress.CodecPool: Got brand-new compressor [.snappy]
> 24/03/20 06:08:44 INFO compress.CodecPool: Got brand-new decompressor 
> [.snappy]
> SUCCESS {code}
> The command exits, but when the same is tried over a ofs path, the command 
> hangs.
> {code:java}
> hbase org.apache.hadoop.hbase.util.CompressionTest 
> ofs://ozone1710862004/test-222compression-vol/compression-buck2/test_file.txt 
> snappy
> 24/03/20 06:05:19 INFO protocolPB.OmTransportFactory: Loading OM transport 
> implementation 
> org.apache.hadoop.ozone.om.protocolPB.Hadoop3OmTransportFactory as specified 
> by configuration.
> 24/03/20 06:05:20 INFO client.ClientTrustManager: Loading certificates for 
> client.
> 24/03/20 06:05:20 WARN impl.MetricsConfig: Cannot locate configuration: tried 
> hadoop-metrics2-hbase.properties,hadoop-metrics2.properties
> 24/03/20 06:05:20 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot 
> period at 10 second(s).
> 24/03/20 06:05:20 INFO impl.MetricsSystemImpl: HBase metrics system started
> 24/03/20 06:05:20 INFO metrics.MetricRegistries: Loaded MetricRegistries 
> class org.apache.hadoop.hbase.metrics.impl.MetricRegistriesImpl
> 24/03/20 06:05:20 INFO rpc.RpcClient: Creating Volume: 
> test-222compression-vol, with om as owner and space quota set to -1 bytes, 
> counts quota set to -1
> 24/03/20 06:05:20 INFO rpc.RpcClient: Creating Bucket: 
> test-222compression-vol/compression-buck2, with bucket layout 
> FILE_SYSTEM_OPTIMIZED, om as owner, Versioning false, Storage Type set to 
> DISK and Encryption set to false, Replication Type set to server-side default 
> replication type, Namespace Quota set to -1, Space Quota set to -1
> 24/03/20 06:05:21 INFO compress.CodecPool: Got brand-new compressor [.snappy]
> 24/03/20 06:05:21 INFO compress.CodecPool: Got brand-new compressor [.snappy]
> 24/03/20 06:05:21 WARN impl.MetricsSystemImpl: HBase metrics system already 
> initialized!
> 24/03/20 06:05:21 INFO metrics.MetricRegistries: Loaded MetricRegistries 
> class org.apache.ratis.metrics.dropwizard3.Dm3MetricRegistriesImpl
> 24/03/20 06:05:22 INFO compress.CodecPool: Got brand-new decompressor 
> [.snappy]
> SUCCESS 
> .
> .
> .{code}
> The command doesnt exit.
> Attaching the jstack of the process below:
> [^hbase_ozone_compression.jstack]
> cc: [~weichiu] 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28584) RS SIGSEGV under heavy replication load

2024-05-09 Thread Whitney Jackson (Jira)
Whitney Jackson created HBASE-28584:
---

 Summary: RS SIGSEGV under heavy replication load
 Key: HBASE-28584
 URL: https://issues.apache.org/jira/browse/HBASE-28584
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 2.5.6
 Environment: RHEL 7.9
JDK 11.0.23
Hadoop 3.2.4
Hbase 2.5.6
Reporter: Whitney Jackson


I'm observing RS crashes under heavy replication load:

 
{code:java}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7f7546873b69, pid=29890, tid=36828
#
# JRE version: Java(TM) SE Runtime Environment 18.9 (11.0.23+7) (build 
11.0.23+7-LTS-222)
# Java VM: Java HotSpot(TM) 64-Bit Server VM 18.9 (11.0.23+7-LTS-222, mixed 
mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# J 24625 c2 
org.apache.hadoop.hbase.util.ByteBufferUtils.copyBufferToStream(Ljava/io/OutputStream;Ljava/nio/ByteBuffer;II)V
 (75 bytes) @ 0x7f7546873b69 [0x7f7546873960+0x0209]
{code}
 

The heavier load comes when a replication peer has been disabled for several 
hours for patching etc. When the peer is re-enabled the replication load is 
high until the peer is all caught up. The crashes happen on the cluster 
receiving the replication edits:

 

I believe this problem started after upgrading from 2.4.x to 2.5.x.

 

One possibly relevant non-standard config I run with:
{code:java}

  hbase.region.store.parallel.put.limit
  
  100
  Added after seeing "failed to accept edits" replication errors 
in the destination region servers indicating this limit was being exceeded 
while trying to process replication edits.

{code}
 

I understand from other Jiras that the problem is likely around direct memory 
usage by Netty. I haven't yet tried switching the Netty allocator to 
{{unpooled}} or {{{}heap{}}}. I also haven't yet tried any of the  
{{io.netty.allocator.*}} options.

 

{{MaxDirectMemorySize}} is set to 26g.

 

Here's the full stack for the relevant thread:

 
{code:java}
Stack: [0x7f72e2e5f000,0x7f72e2f6],  sp=0x7f72e2f5e450,  free 
space=1021k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
J 24625 c2 
org.apache.hadoop.hbase.util.ByteBufferUtils.copyBufferToStream(Ljava/io/OutputStream;Ljava/nio/ByteBuffer;II)V
 (75 bytes) @ 0x7f7546873b69 [0x7f7546873960+0x0209]
J 26253 c2 
org.apache.hadoop.hbase.ByteBufferKeyValue.write(Ljava/io/OutputStream;Z)I (21 
bytes) @ 0x7f7545af2d84 [0x7f7545af2d20+0x0064]
J 22971 c2 
org.apache.hadoop.hbase.codec.KeyValueCodecWithTags$KeyValueEncoder.write(Lorg/apache/hadoop/hbase/Cell;)V
 (27 bytes) @ 0x7f754663f700 [0x7f754663f4c0+0x0240]
J 25251 c2 
org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.write(Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelHandlerContext;Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
 (90 bytes) @ 0x7f7546a53038 [0x7f7546a50e60+0x21d8]
J 21182 c2 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(Ljava/lang/Object;Lorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
 (73 bytes) @ 0x7f7545f4d90c [0x7f7545f4d3a0+0x056c]
J 21181 c2 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.write(Ljava/lang/Object;ZLorg/apache/hbase/thirdparty/io/netty/channel/ChannelPromise;)V
 (149 bytes) @ 0x7f7545fd680c [0x7f7545fd65e0+0x022c]
J 25389 c2 org.apache.hadoop.hbase.ipc.NettyRpcConnection$$Lambda$247.run()V 
(16 bytes) @ 0x7f7546ade660 [0x7f7546ade140+0x0520]
J 24098 c2 
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(J)Z
 (109 bytes) @ 0x7f754678fbb8 [0x7f754678f8e0+0x02d8]
J 27297% c2 
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run()V (603 
bytes) @ 0x7f75466c4d48 [0x7f75466c4c80+0x00c8]
j  
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run()V+44
j  
org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run()V+11
j  
org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run()V+4
J 12278 c1 java.lang.Thread.run()V java.base@11.0.23 (17 bytes) @ 
0x7f753e11f084 [0x7f753e11ef40+0x0144]
v  ~StubRoutines::call_stub
V  [libjvm.so+0x85574a]  JavaCalls::call_helper(JavaValue*, methodHandle 
const&, JavaCallArguments*, Thread*)+0x27a
V  [libjvm.so+0x853d2e]  JavaCalls::call_virtual(JavaValue*, Handle, Klass*, 
Symbol*, Symbol*, Thread*)+0x19e
V  [libjvm.so+0x8ffddf]  thread_entry(JavaThread*, Thread*)+0x9f
V  [libjvm.so+0xdb68d1]  JavaThread::thread_main_inner()+0x131
V  [libjvm.so+0xdb2c4c]  Thread::call_run()+0x13c
V  [libjvm.so+0xc1f2e6]  

[jira] [Created] (HBASE-28583) Upgrade from 2.5.8 to 3.0 crash with InvalidProtocolBufferException: Message missing required fields: old_table_schema

2024-05-09 Thread Ke Han (Jira)
Ke Han created HBASE-28583:
--

 Summary: Upgrade from 2.5.8 to 3.0 crash with 
InvalidProtocolBufferException: Message missing required fields: 
old_table_schema
 Key: HBASE-28583
 URL: https://issues.apache.org/jira/browse/HBASE-28583
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 2.5.8, 3.0.0
Reporter: Ke Han
 Attachments: commands.txt, hbase--master-cc13b0df0f3a.log, 
persistent.tar.gz

When migrating data from 2.5.8 cluster (1HM, 2RS, 1 HDFS) to 3.0.0 (1 HM, 2 RS, 
2 HDFS), I met the following exception and the upgrade failed.

 
{code:java}
2024-05-09T20:16:20,638 ERROR [master/hmaster:16000:becomeActiveMaster] 
master.HMaster: Failed to become active master
org.apache.hbase.thirdparty.com.google.protobuf.InvalidProtocolBufferException: 
Message missing required fields: old_table_schema
        at 
org.apache.hbase.thirdparty.com.google.protobuf.UninitializedMessageException.asInvalidProtocolBufferException(UninitializedMessageException.java:56)
 ~[hbase-shaded-protobuf-4.1.7.jar:4.1.7]
        at 
org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.checkMessageInitialized(AbstractParser.java:45)
 ~[hbase-shaded-protobuf-4.1.7.jar:4.1.7]
        at 
org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:97)
 ~[hbase-shaded-protobuf-4.1.7.jar:4.1.7]
        at 
org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:102)
 ~[hbase-shaded-protobuf-4.1.7.jar:4.1.7]
        at 
org.apache.hbase.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:25)
 ~[hbase-shaded-protobuf-4.1.7.jar:4.1.7]
        at 
org.apache.hbase.thirdparty.com.google.protobuf.Any.unpack(Any.java:118) 
~[hbase-shaded-protobuf-4.1.7.jar:4.1.7]
        at 
org.apache.hadoop.hbase.procedure2.ProcedureUtil$StateSerializer.deserialize(ProcedureUtil.java:125)
 ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
        at 
org.apache.hadoop.hbase.master.procedure.RestoreSnapshotProcedure.deserializeStateData(RestoreSnapshotProcedure.java:303)
 ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
        at 
org.apache.hadoop.hbase.procedure2.ProcedureUtil.convertToProcedure(ProcedureUtil.java:295)
 ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
        at 
org.apache.hadoop.hbase.procedure2.store.ProtoAndProcedure.getProcedure(ProtoAndProcedure.java:43)
 ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
        at 
org.apache.hadoop.hbase.procedure2.store.InMemoryProcedureIterator.next(InMemoryProcedureIterator.java:90)
 ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.loadProcedures(ProcedureExecutor.java:517)
 ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$200(ProcedureExecutor.java:80)
 ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$1.load(ProcedureExecutor.java:344)
 ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
        at 
org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.load(RegionProcedureStore.java:287)
 ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.load(ProcedureExecutor.java:335)
 ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
        at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.init(ProcedureExecutor.java:666)
 ~[hbase-procedure-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
        at 
org.apache.hadoop.hbase.master.HMaster.createProcedureExecutor(HMaster.java:1860)
 ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
        at 
org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1019)
 ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
        at 
org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2524)
 ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
        at 
org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:613) 
~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
        at 
org.apache.hadoop.hbase.trace.TraceUtil.lambda$tracedRunnable$2(TraceUtil.java:155)
 ~[hbase-common-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_362]
2024-05-09T20:16:20,639 ERROR [master/hmaster:16000:becomeActiveMaster] 
master.HMaster: * ABORTING master hmaster,16000,1715285771112: Unhandled 
exception. Starting shutdown. *
org.apache.hbase.thirdparty.com.google.protobuf.InvalidProtocolBufferException: 
Message missing 

[jira] [Created] (HBASE-28582) ModifyTableProcedure should not reset TRSP on region node when closing unused region replicas

2024-05-09 Thread Duo Zhang (Jira)
Duo Zhang created HBASE-28582:
-

 Summary: ModifyTableProcedure should not reset TRSP on region node 
when closing unused region replicas
 Key: HBASE-28582
 URL: https://issues.apache.org/jira/browse/HBASE-28582
 Project: HBase
  Issue Type: Bug
  Components: proc-v2
Reporter: Duo Zhang
Assignee: Duo Zhang


Found this when digging HBASE-28522.

First, this is not safe as MTP does not like DTP where we hold the exclusive 
lock all the time.
Second, even if we hold the exclusive lock all the time, as showed in 
HBASE-28522, we may still hang there forever because SCP will not interrupt the 
TRSP.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)