Pratyush Bhatt created HDDS-10536:
-------------------------------------
Summary: [Hbase Ozone] BLOCK_TOKEN_VERIFICATION failed while long
running YCSB
Key: HDDS-10536
URL: https://issues.apache.org/jira/browse/HDDS-10536
Project: Apache Ozone
Issue Type: Bug
Components: OM
Reporter: Pratyush Bhatt
YCSB Fails after 4 days(As the command was triggered around _2024-03-12
17:49:28_ and failed at around {_}2024-03-16 07:43:39{_}), with [INSERT-FAILED]
underlying issue being a _BLOCK_TOKEN_VERIFICATION_ issue.
_Error in client side:_
{code:java}
2024-03-16 07:43:29,233|INFO|Thread-36|machine.py:205 -
run()||GUID=16aa1f97-ce35-4680-82a3-b8ac527277b2|2024-03-16 00:43:29:224 309240
sec: 7052507 operations; 8.9 current ops/sec; est completion in 16 days 17
hours [INSERT: Count=89, Max=2588671, Min=17104, Avg=404562.07, 90=2418687,
99=2545663, 99.9=2588671, 99.99=2588671]
2024-03-16 07:43:33,361|INFO|Thread-36|machine.py:205 -
run()||GUID=16aa1f97-ce35-4680-82a3-b8ac527277b2|Error inserting, not retrying
any more. number of attempts: 1Insertion Retry Limit: 0
2024-03-16 07:43:39,245|INFO|Thread-36|machine.py:205 -
run()||GUID=16aa1f97-ce35-4680-82a3-b8ac527277b2|2024-03-16 00:43:39:224 309250
sec: 7052548 operations; 4.1 current ops/sec; est completion in 16 days 17
hours [CLEANUP: Count=2, Max=268, Min=35, Avg=151.5, 90=268, 99=268, 99.9=268,
99.99=268] [INSERT: Count=41, Max=28426239, Min=24736, Avg=1247846.24,
90=2533375, 99=28426239, 99.9=28426239, 99.99=28426239] [INSERT-FAILED:
Count=1, Max=38436863, Min=38404096, Avg=38420480, 90=38436863, 99=38436863,
99.9=38436863, 99.99=38436863]
2024-03-16 07:46:22,355|INFO|Thread-37|test_lrt_hbase_ozone.py:165 -
open_files_helper()|Sleeping for 5 minutes. {code}
Checked the Hbase Master logs of the same time(Cluster is in PDT TZ, and QE pod
is in UTC)
{code:java}
2024-03-16 00:40:53,855 ERROR org.apache.hadoop.hdds.scm.XceiverClientGrpc:
Failed to execute command ReadChunk on the pipeline Pipeline[ Id:
781daaaf-6183-45fd-9f1b-f0be538247be, Nodes:
0da8ff95-6d92-4956-83db-dee85e97488e(vc0121.example.com/10.17.207.31)a19eff5a-6feb-417b-b433-62a5f5af80bc(vc0127.example.com/10.17.207.37)b9d66e9e-8c2b-49ff-bf07-3fd3ea146a7f(vc0123.example.com/10.17.207.33),
ReplicationConfig: STANDALONE/THREE, State:OPEN,
leaderId:0da8ff95-6d92-4956-83db-dee85e97488e,
CreationTimestamp2024-03-12T06:12:35.741-07:00[America/Los_Angeles]].
2024-03-16 00:40:53,855 WARN
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls: Failed to read chunk
113750153625707608_chunk_0 (len=12266) conID: 3060 locID: 113750153625707608
bcsId: 2782589 from
0da8ff95-6d92-4956-83db-dee85e97488e(vc0121.example.com/10.17.207.31); will try
another datanode.
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
BLOCK_TOKEN_VERIFICATION_FAILED for null: Expired token for user: hbase
(auth:SIMPLE)
at
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.validateContainerResponse(ContainerProtocolCalls.java:675)
at
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.lambda$createValidators$4(ContainerProtocolCalls.java:686)
at
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:400)
at
org.apache.hadoop.hdds.scm.XceiverClientGrpc.lambda$sendCommandWithTraceIDAndRetry$0(XceiverClientGrpc.java:340)
at
org.apache.hadoop.hdds.tracing.TracingUtil.executeInSpan(TracingUtil.java:159)
at
org.apache.hadoop.hdds.tracing.TracingUtil.executeInNewSpan(TracingUtil.java:149)
at
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:335)
at
org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:316)
at
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.readChunk(ContainerProtocolCalls.java:358)
at
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.lambda$readChunk$2(ContainerProtocolCalls.java:345)
at
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.tryEachDatanode(ContainerProtocolCalls.java:147)
at
org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.readChunk(ContainerProtocolCalls.java:344)
at
org.apache.hadoop.hdds.scm.storage.ChunkInputStream.readChunk(ChunkInputStream.java:425)
at
org.apache.hadoop.hdds.scm.storage.ChunkInputStream.readChunkDataIntoBuffers(ChunkInputStream.java:402)
at
org.apache.hadoop.hdds.scm.storage.ChunkInputStream.readChunkFromContainer(ChunkInputStream.java:387)
at
org.apache.hadoop.hdds.scm.storage.ChunkInputStream.prepareRead(ChunkInputStream.java:319)
at
org.apache.hadoop.hdds.scm.storage.ChunkInputStream.read(ChunkInputStream.java:173)
at
org.apache.hadoop.hdds.scm.storage.ByteArrayReader.readFromBlock(ByteArrayReader.java:54)
at
org.apache.hadoop.hdds.scm.storage.BlockInputStream.readWithStrategy(BlockInputStream.java:367)
at
org.apache.hadoop.hdds.scm.storage.ExtendedInputStream.read(ExtendedInputStream.java:56)
at
org.apache.hadoop.hdds.scm.storage.ByteArrayReader.readFromBlock(ByteArrayReader.java:54)
at
org.apache.hadoop.hdds.scm.storage.MultipartInputStream.readWithStrategy(MultipartInputStream.java:96)
at
org.apache.hadoop.hdds.scm.storage.ExtendedInputStream.read(ExtendedInputStream.java:56)
at
org.apache.hadoop.fs.ozone.OzoneFSInputStream.read(OzoneFSInputStream.java:64)
at org.apache.hadoop.fs.FSInputStream.read(FSInputStream.java:78)
at
org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:97)
at
org.apache.hadoop.hbase.io.util.BlockIOUtils.preadWithExtra(BlockIOUtils.java:233)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1494)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1717)
at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1528)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1322)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1242)
at
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$CellBasedKeyBlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:318)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:664)
at
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:617)
{code}
Can see the ticket was valid in the host(As I mentioned before in the cluster's
krb5.conf lifetime is set as 1d and renew as 8d and in our background thread
the renew happens on every 8 hour and when we would have got the ticket it
would have been 03/20/24 10:49:28 (Renew until limit) - 8 days(Renew) i.e.
_03/12/24 10:49:00_ so _kinit -R_ seems to have happened about 12 times:
{code:java}
[root@vc0122 ~]# klist
Ticket cache: FILE:/tmp/krb5cc_0Default principal: [email protected]
Valid starting Expires Service principal
03/16/24 10:49:28 03/17/24 10:49:28 krbtgt/[email protected] renew
until 03/20/24 10:49:28 {code}
cc: [~weichiu]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]