Michael Stack created HBASE-25677:
-------------------------------------
Summary: Server+table counters on each scan #nextRaw invocation
becomes a bottleneck when heavy load
Key: HBASE-25677
URL: https://issues.apache.org/jira/browse/HBASE-25677
Project: HBase
Issue Type: Sub-task
Components: metrics
Affects Versions: 2.3.2
Reporter: Michael Stack
Assignee: Michael Stack
On a heavily loaded server mostly doing reads/scan, I saw that 90+% of handlers
were BLOCKED in this fashion in thread dumps:
{code}
"RpcServer.default.FPBQ.Fifo.handler=117,queue=17,port=16020" #161 daemon
prio=5 os_prio=0 tid=0x00007f748757f000 nid=0x73e9 waiting for monitor entry
[0x00007f74783e0000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1674)
- waiting to lock <0x00007f7647e3cc38> (a
java.util.concurrent.ConcurrentHashMap$Node)
at
org.apache.hadoop.hbase.regionserver.MetricsTableQueryMeterImpl.getOrCreateTableMeter(MetricsTableQueryMeterImpl.java:80)
at
org.apache.hadoop.hbase.regionserver.MetricsTableQueryMeterImpl.updateTableReadQueryMeter(MetricsTableQueryMeterImpl.java:90)
at
org.apache.hadoop.hbase.regionserver.RegionServerTableMetrics.updateTableReadQueryMeter(RegionServerTableMetrics.java:89)
at
org.apache.hadoop.hbase.regionserver.MetricsRegionServer.updateReadQueryMeter(MetricsRegionServer.java:274)
at
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6742)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3319)
- locked <0x00007f896c0165a0> (a
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:3566)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:44858)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:393)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
{code}
It kept up for good periods of time.
I saw it to a leser extent on other servers, with less load.
These RS had 400+ Regions a good few of which were serving out scan reads; the
server was doing ~1M hits a second. In this scenario, I saw the above
bottleneck.
Looking at it, it came in w/ when the parent issue feature was added. There are
these read counts and then there were also write counts. The write counts are
mostly batch-based. Let me do same thing here for the read.... update the
central server+table count after scan is done rather than per invocation of
#nextRaw.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)