Andrew Purtell created HBASE-21358:
--------------------------------------
Summary: Snapshot procedure fails but SnapshotManager thinks it is
still running
Key: HBASE-21358
URL: https://issues.apache.org/jira/browse/HBASE-21358
Project: HBase
Issue Type: Bug
Components: snapshots
Affects Versions: 1.3.2
Reporter: Andrew Purtell
A snapshot procedure fails due to chaotic test action but the snapshot manager
still thinks it is running. The test client spins needlessly checking for
something that will never actually complete. We give up eventually but we could
be failing this a lot faster.
On the integration client we are checking and re-checking:
2018-10-20 01:06:11,718 DEBUG [ChaosMonkeyThread] client.HBaseAdmin: Getting
current status of snapshot from master...
2018-10-20 01:06:11,719 DEBUG [ChaosMonkeyThread] client.HBaseAdmin: (#40)
Sleeping: 8571ms while waiting for snapshot completion.
This is what it looks like on the master side each time the client checks in:
2018-10-20 01:04:54,565 DEBUG
[RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=8100]
master.MasterRpcServices: Checking to see if snapshot from request:{
ss=IntegrationTestBigLinkedList-it-1539997289258
table=IntegrationTestBigLinkedList type=FLUSH } is done
2018-10-20 01:04:54,565 DEBUG
[RpcServer.FifoWFPBQ.default.handler=29,queue=2,port=8100]
snapshot.SnapshotManager: Snapshoting '{
ss=IntegrationTestBigLinkedList-it-1539997289258
table=IntegrationTestBigLinkedList type=FLUSH }' is still in progress!
There is no running procedure for the snapshot. The procedure has failed. The
snapshot manager does not take any useful action afterward but believes the
snapshot to still be in progress.
I see related complaint from the hfile archiver task afterward, empty
directories, failure to parse protobuf in descriptor files... Seems like there
was junk in the filesystem left over from the failed snapshot. The master was
soon restarted by chaos action, and now I don't see these complaints, so that
partially complete snapshot may have been cleaned up.
This is with 1.3.2, but patched to include the multithreaded hfile archiving
improvements from later versions.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)