[
https://issues.apache.org/jira/browse/HDDS-10550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sammi Chen reassigned HDDS-10550:
---------------------------------
Assignee: Sammi Chen
> [Hbase Ozone] YCSB Insert fails with "OMKeyDeleteRequestWithFSO: Key delete
> failed."
> ------------------------------------------------------------------------------------
>
> Key: HDDS-10550
> URL: https://issues.apache.org/jira/browse/HDDS-10550
> Project: Apache Ozone
> Issue Type: Bug
> Components: OM
> Reporter: Pratyush Bhatt
> Assignee: Sammi Chen
> Priority: Major
>
> *Note: Client machine where YCSB is triggered is in UTC tz, and Cluster is in
> PDT*
> YCSB Insert fails after ~13 hours.
> {code:java}
> 2024-03-18 17:31:23,529|INFO|Thread-92|machine.py:205 -
> run()||GUID=a68008c1-78db-40f6-83f2-8c73a148a18f|2024-03-18 10:31:23:519
> 45480 sec: 840736 operations; 6.7 current ops/sec; est completion in 24 days
> 12 hours [INSERT: Count=67, Max=18284543, Min=21808, Avg=1205365.73,
> 90=2609151, 99=5935103, 99.9=18284543, 99.99=18284543]
> 2024-03-18 17:31:28,966|INFO|Thread-92|machine.py:205 -
> run()||GUID=a68008c1-78db-40f6-83f2-8c73a148a18f|Error inserting, not
> retrying any more. number of attempts: 1Insertion Retry Limit: 0
> 2024-03-18 17:31:33,537|INFO|Thread-92|machine.py:205 -
> run()||GUID=a68008c1-78db-40f6-83f2-8c73a148a18f|2024-03-18 10:31:33:519
> 45490 sec: 840788 operations; 5.2 current ops/sec; est completion in 24 days
> 12 hours [CLEANUP: Count=2, Max=139, Min=10, Avg=74.5, 90=139, 99=139,
> 99.9=139, 99.99=139] [INSERT: Count=52, Max=5058559, Min=20768,
> Avg=1127204.46, 90=2635775, 99=4816895, 99.9=5058559, 99.99=5058559]
> [INSERT-FAILED: Count=1, Max=38305791, Min=38273024, Avg=38289408,
> 90=38305791, 99=38305791, 99.9=38305791, 99.99=38305791]{code}
> Checked the logs of Regionservers, almost all shows similar logs during the
> failure time:
> {code:java}
> 2024-03-18 10:31:10,199 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> Closed
> table_ycsb1710695116,user8499,1710695130852.aa5c9ba2da2404b2eadae58e41e507e7.
> 2024-03-18 10:31:10,200 INFO org.apache.hadoop.hbase.regionserver.HRegion:
> Closed
> table_ycsb1710695116,user8399,1710695130852.bb1a196fb5cd18429103e4fcde2e1e40.
> at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:219)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:385)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:88)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:416)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:412)
> at org.apache.hadoop.hbase.ipc.Call.setTimeout(Call.java:107)
> at
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:136)
> at
> org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715)
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34)
> at
> org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703)
> at
> org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790)
> at
> org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503)
> 2024-03-18 10:31:15,202 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: HBase metrics system
> stopped.
> 2024-03-18 10:31:15,703 INFO org.apache.hadoop.metrics2.impl.MetricsConfig:
> Loaded properties from hadoop-metrics2.properties
> 2024-03-18 10:31:15,706 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled Metric snapshot
> period at 10 second(s).
> 2024-03-18 10:31:15,706 INFO
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: HBase metrics system
> started
> 2024-03-18 10:31:19,676 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Failed report transition
> server { host_name: "vc0140.example.com" port: 22101 start_code:
> 1710689975823 } transition { transition_code: OPENED region_info { region_id:
> 1710737533661 table_name { namespace: "default" qualifier:
> "table_ycsb1710737518" } start_key: "user4399" end_key: "user4499" offline:
> false split: false replica_id: 0 } open_seq_num: 9915 proc_id: 5757 }; retry
> (#0) immediately.
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call to
> address=vc0132.example.com/10.17.207.42:22001 failed on local exception:
> org.apache.hadoop.hbase.ipc.CallTimeoutException:
> Call[id=31282,methodName=ReportRegionStateTransition], waitTime=10002ms,
> rpcTimeout=10000ms
> at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:219)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:385)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:88)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:416)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:412)
> at org.apache.hadoop.hbase.ipc.Call.setTimeout(Call.java:107)
> at
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:136)
> at
> org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715)
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34)
> at
> org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703)
> at
> org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790)
> at
> org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException:
> Call[id=31282,methodName=ReportRegionStateTransition], waitTime=10002ms,
> rpcTimeout=10000ms
> at
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:137)
> ... 6 more
> 2024-03-18 10:31:19,867 INFO
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Slow sync cost: 2528
> ms, current pipeline: []
> 2024-03-18 10:31:20,207 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Failed report transition
> server { host_name: "vc0140.example.com" port: 22101 start_code:
> 1710689975823 } transition { transition_code: CLOSED region_info { region_id:
> 1710695130852 table_name { namespace: "default" qualifier:
> "table_ycsb1710695116" } start_key: "user8399" end_key: "user8499" offline:
> false split: false replica_id: 0 } proc_id: 5750 }; retry (#0) immediately.
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call to
> address=vc0132.example.com/10.17.207.42:22001 failed on local exception:
> org.apache.hadoop.hbase.ipc.CallTimeoutException:
> Call[id=31283,methodName=ReportRegionStateTransition], waitTime=10006ms,
> rpcTimeout=10000ms
> at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:219)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:385)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:88)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:416)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:412)
> at org.apache.hadoop.hbase.ipc.Call.setTimeout(Call.java:107)
> at
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:136)
> at
> org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715)
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34)
> at
> org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703)
> at
> org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790)
> at
> org.apache.hbase.thirdparty.io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException:
> Call[id=31283,methodName=ReportRegionStateTransition], waitTime=10006ms,
> rpcTimeout=10000ms
> at
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:137)
> ... 6 more {code}
> Checked the OM Leader log around this time, can see the below error logs:
> {code:java}
> 2024-03-18 10:40:28,166 ERROR [om163-OMStateMachineApplyTransactionThread -
> 0]-org.apache.hadoop.ozone.om.request.key.OMKeyDeleteRequestWithFSO: Key
> delete failed. Volume:volhbase1, Bucket:buckethbase1,
> Key:hbase/archive/data/default/table_ycsb1710695116/daa0f8300994990f671fc60c006e7bfd.
> DIRECTORY_NOT_EMPTY org.apache.hadoop.ozone.om.exceptions.OMException:
> Directory is not empty.
> Key:hbase/archive/data/default/table_ycsb1710695116/daa0f8300994990f671fc60c006e7bfd
> at
> org.apache.hadoop.ozone.om.request.key.OMKeyDeleteRequestWithFSO.validateAndUpdateCache(OMKeyDeleteRequestWithFSO.java:137)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.lambda$0(OzoneManagerRequestHandler.java:406)
> at
> org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequestImpl(OzoneManagerRequestHandler.java:404)
> at
> org.apache.hadoop.ozone.protocolPB.RequestHandler.handleWriteRequest(RequestHandler.java:63)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:525)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:343)
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2024-03-18 10:40:33,261 ERROR [om163-OMStateMachineApplyTransactionThread -
> 0]-org.apache.hadoop.ozone.om.request.key.OMKeyDeleteRequestWithFSO: Key
> delete failed. Volume:volhbase1, Bucket:buckethbase1,
> Key:hbase/archive/data/default/table_ycsb1710695116.
> DIRECTORY_NOT_EMPTY org.apache.hadoop.ozone.om.exceptions.OMException:
> Directory is not empty. Key:hbase/archive/data/default/table_ycsb1710695116
> at
> org.apache.hadoop.ozone.om.request.key.OMKeyDeleteRequestWithFSO.validateAndUpdateCache(OMKeyDeleteRequestWithFSO.java:137)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.lambda$0(OzoneManagerRequestHandler.java:406)
> at
> org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequestImpl(OzoneManagerRequestHandler.java:404)
> at
> org.apache.hadoop.ozone.protocolPB.RequestHandler.handleWriteRequest(RequestHandler.java:63)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:525)
> at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$1(OzoneManagerStateMachine.java:343)
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2024-03-18 10:40:36,149 INFO [Socket Reader #1 for port
> 9862]-SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for
> hue/[email protected] (auth:KERBEROS){code}
> cc: [~weichiu]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]