Pratyush Bhatt created HDDS-10550:
-------------------------------------
Summary: [Hbase Ozone] YCSB Insert fails with
"OMKeyDeleteRequestWithFSO: Key delete failed."
Key: HDDS-10550
URL: https://issues.apache.org/jira/browse/HDDS-10550
Project: Apache Ozone
Issue Type: Bug
Components: OM
Reporter: Pratyush Bhatt
*Note: Client machine where YCSB is triggered is in UTC tz, and Cluster is in
PDT*
YCSB Insert fails after ~13 hours.
{code:java}
2024-03-18 17:31:23,529|INFO|Thread-92|machine.py:205 -
run()||GUID=a68008c1-78db-40f6-83f2-8c73a148a18f|2024-03-18 10:31:23:519 45480
sec: 840736 operations; 6.7 current ops/sec; est completion in 24 days 12 hours
[INSERT: Count=67, Max=18284543, Min=21808, Avg=1205365.73, 90=2609151,
99=5935103, 99.9=18284543, 99.99=18284543]
2024-03-18 17:31:28,966|INFO|Thread-92|machine.py:205 -
run()||GUID=a68008c1-78db-40f6-83f2-8c73a148a18f|Error inserting, not retrying
any more. number of attempts: 1Insertion Retry Limit: 0
2024-03-18 17:31:33,537|INFO|Thread-92|machine.py:205 -
run()||GUID=a68008c1-78db-40f6-83f2-8c73a148a18f|2024-03-18 10:31:33:519 45490
sec: 840788 operations; 5.2 current ops/sec; est completion in 24 days 12 hours
[CLEANUP: Count=2, Max=139, Min=10, Avg=74.5, 90=139, 99=139, 99.9=139,
99.99=139] [INSERT: Count=52, Max=5058559, Min=20768, Avg=1127204.46,
90=2635775, 99=4816895, 99.9=5058559, 99.99=5058559] [INSERT-FAILED: Count=1,
Max=38305791, Min=38273024, Avg=38289408, 90=38305791, 99=38305791,
99.9=38305791, 99.99=38305791]{code}
Checked the logs of Regionservers, almost all shows similar logs during the
failure time:
{code:java}
2024-03-18 10:31:10,199 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Closed
table_ycsb1710695116,user8499,1710695130852.aa5c9ba2da2404b2eadae58e41e507e7.
2024-03-18 10:31:10,200 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Closed
table_ycsb1710695116,user8399,1710695130852.bb1a196fb5cd18429103e4fcde2e1e40.
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:219)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:385)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:88)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:416)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:412)
at org.apache.hadoop.hbase.ipc.Call.setTimeout(Call.java:107)
at
org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:136)
at
org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715)
at
org.apache.hbase.thirdparty.io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34)
at
org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703)
at
org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790)
at
org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503)
2024-03-18 10:31:15,202 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
HBase metrics system stopped.
2024-03-18 10:31:15,703 INFO org.apache.hadoop.metrics2.impl.MetricsConfig:
Loaded properties from hadoop-metrics2.properties
2024-03-18 10:31:15,706 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
Scheduled Metric snapshot period at 10 second(s).
2024-03-18 10:31:15,706 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
HBase metrics system started
2024-03-18 10:31:19,676 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Failed report transition
server { host_name: "vc0140.example.com" port: 22101 start_code: 1710689975823
} transition { transition_code: OPENED region_info { region_id: 1710737533661
table_name { namespace: "default" qualifier: "table_ycsb1710737518" }
start_key: "user4399" end_key: "user4499" offline: false split: false
replica_id: 0 } open_seq_num: 9915 proc_id: 5757 }; retry (#0) immediately.
org.apache.hadoop.hbase.ipc.CallTimeoutException: Call to
address=vc0132.example.com/10.17.207.42:22001 failed on local exception:
org.apache.hadoop.hbase.ipc.CallTimeoutException:
Call[id=31282,methodName=ReportRegionStateTransition], waitTime=10002ms,
rpcTimeout=10000ms
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:219)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:385)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:88)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:416)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:412)
at org.apache.hadoop.hbase.ipc.Call.setTimeout(Call.java:107)
at
org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:136)
at
org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715)
at
org.apache.hbase.thirdparty.io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34)
at
org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703)
at
org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790)
at
org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException:
Call[id=31282,methodName=ReportRegionStateTransition], waitTime=10002ms,
rpcTimeout=10000ms
at
org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:137)
... 6 more
2024-03-18 10:31:19,867 INFO
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Slow sync cost: 2528
ms, current pipeline: []
2024-03-18 10:31:20,207 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Failed report transition
server { host_name: "vc0140.example.com" port: 22101 start_code: 1710689975823
} transition { transition_code: CLOSED region_info { region_id: 1710695130852
table_name { namespace: "default" qualifier: "table_ycsb1710695116" }
start_key: "user8399" end_key: "user8499" offline: false split: false
replica_id: 0 } proc_id: 5750 }; retry (#0) immediately.
org.apache.hadoop.hbase.ipc.CallTimeoutException: Call to
address=vc0132.example.com/10.17.207.42:22001 failed on local exception:
org.apache.hadoop.hbase.ipc.CallTimeoutException:
Call[id=31283,methodName=ReportRegionStateTransition], waitTime=10006ms,
rpcTimeout=10000ms
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:219)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:385)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:88)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:416)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:412)
at org.apache.hadoop.hbase.ipc.Call.setTimeout(Call.java:107)
at
org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:136)
at
org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715)
at
org.apache.hbase.thirdparty.io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34)
at
org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703)
at
org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790)
at
org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException:
Call[id=31283,methodName=ReportRegionStateTransition], waitTime=10006ms,
rpcTimeout=10000ms
at
org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:137)
... 6 more {code}
Checked the OM Leader log around this time, can see the below error logs:
{code:java}
2024-03-18 10:40:28,166 ERROR [om163-OMStateMachineApplyTransactionThread -
0]-org.apache.hadoop.ozone.om.request.key.OMKeyDeleteRequestWithFSO: Key delete
failed. Volume:volhbase1, Bucket:buckethbase1,
Key:hbase/archive/data/default/table_ycsb1710695116/daa0f8300994990f671fc60c006e7bfd.
DIRECTORY_NOT_EMPTY org.apache.hadoop.ozone.om.exceptions.OMException:
Directory is not empty.
Key:hbase/archive/data/default/table_ycsb1710695116/daa0f8300994990f671fc60c006e7bfd
at
org.apache.hadoop.ozone.om.request.key.OMKeyDeleteRequestWithFSO.validateAndUpdateCache(OMKeyDeleteRequestWithFSO.java:137)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.lambda$0(OzoneManagerRequestHandler.java:406)
at
org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequestImpl(OzoneManagerRequestHandler.java:404)
at
org.apache.hadoop.ozone.protocolPB.RequestHandler.handleWriteRequest(RequestHandler.java:63)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:525)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:343)
at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2024-03-18 10:40:33,261 ERROR [om163-OMStateMachineApplyTransactionThread -
0]-org.apache.hadoop.ozone.om.request.key.OMKeyDeleteRequestWithFSO: Key delete
failed. Volume:volhbase1, Bucket:buckethbase1,
Key:hbase/archive/data/default/table_ycsb1710695116.
DIRECTORY_NOT_EMPTY org.apache.hadoop.ozone.om.exceptions.OMException:
Directory is not empty. Key:hbase/archive/data/default/table_ycsb1710695116
at
org.apache.hadoop.ozone.om.request.key.OMKeyDeleteRequestWithFSO.validateAndUpdateCache(OMKeyDeleteRequestWithFSO.java:137)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.lambda$0(OzoneManagerRequestHandler.java:406)
at
org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequestImpl(OzoneManagerRequestHandler.java:404)
at
org.apache.hadoop.ozone.protocolPB.RequestHandler.handleWriteRequest(RequestHandler.java:63)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:525)
at
org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:343)
at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2024-03-18 10:40:36,149 INFO [Socket Reader #1 for port
9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for
hue/[email protected] (auth:KERBEROS){code}
cc: [~weichiu]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]