stack created HBASE-21120:
-----------------------------
Summary: MoveRegionProcedure makes no progress; goes to STUCK
Key: HBASE-21120
URL: https://issues.apache.org/jira/browse/HBASE-21120
Project: HBase
Issue Type: Bug
Components: amv2
Reporter: stack
This is tip of branch-2.0. I'm running cluster tests. I've seen this twice now
where a slowDeterministic schedules a move procedure just after startup and we
go dead-in-the-water never going beyond schedule of the unassign.We even
schedule subsequent moves but they go no where either.
{code}
...
2018-08-26 10:19:30,352 INFO [PEWorker-14] procedure2.ProcedureExecutor:
Finished pid=286, ppid=260, state=SUCCESS; MoveRegionProcedure
hri=fb79d389cf324aaf3ad4b686d3ad21d5,
source=ve0530.halxg.cloudera.com,16020,1535303391139,
destination=ve0530.halxg.cloudera.com,16020,1535303391139 in 10.8100sec
...
// MASTER KILLED
....
2018-08-26 10:21:10,585 INFO [Thread-19] assignment.RegionStateStore: Load
hbase:meta entry region=fb79d389cf324aaf3ad4b686d3ad21d5, regionState=OPEN,
lastHost=ve0530.halxg.cloudera.com,16020,1535303391139,
regionLocation=ve0530.halxg.cloudera.com,16020,1535303391139, openSeqNum=42095
...
2018-08-26 10:26:15,531 INFO
[RpcServer.default.FPBQ.Fifo.handler=46,queue=1,port=16000] master.HMaster:
Client=stack//10.17.240.20 move hri=fb79d389cf324aaf3ad4b686d3ad21d5,
source=ve0530.halxg.cloudera.com,16020,1535303391139,
destination=ve0534.halxg.cloudera.com,16020,1535303812180, running balancer
2018-08-26 10:26:16,635 INFO [PEWorker-13] procedure.MasterProcedureScheduler:
pid=413, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure
hri=fb79d389cf324aaf3ad4b686d3ad21d5,
source=ve0530.halxg.cloudera.com,16020,1535303391139,
destination=ve0534.halxg.cloudera.com,16020,1535303812180 checking lock on
fb79d389cf324aaf3ad4b686d3ad21d5
2018-08-26 10:26:16,896 INFO [PEWorker-14] procedure2.ProcedureExecutor:
Finished pid=405, state=SUCCESS; ServerCrashProcedure
server=ve0540.halxg.cloudera.com,16020,1535303391120, splitWal=true, meta=true
in 25.8450sec
2018-08-26 10:26:16,896 INFO [PEWorker-13] procedure.MasterProcedureScheduler:
pid=414, ppid=413, state=RUNNABLE:REGION_TRANSITION_QUEUE; UnassignProcedure
table=IntegrationTestBigLinkedList, region=fb79d389cf324aaf3ad4b686d3ad21d5,
server=ve0530.halxg.cloudera.com,16020,1535303391139 checking lock on
fb79d389cf324aaf3ad4b686d3ad21d5
2018-08-26 10:26:29,308 INFO
[RpcServer.default.FPBQ.Fifo.handler=0,queue=0,port=16000]
master.ServerManager: Registering
regionserver=ve0540.halxg.cloudera.com,16020,1535304386779
2018-08-26
10:26:29,342 INFO [RegionServerTracker-0] master.RegionServerTracker:
RegionServer ephemeral node created, adding
[ve0540.halxg.cloudera.com,16020,1535304386779]
2018-08-26
10:27:03,820 INFO [ReadOnlyZKClient-ve0524.halxg.cloudera.com:2222@0x1da09819]
zookeeper.ZooKeeper: Session: 0x16577368c840054 closed
2018-08-26 10:27:03,820 INFO
[ReadOnlyZKClient-ve0524.halxg.cloudera.com:2222@0x1da09819-EventThread]
zookeeper.ClientCnxn: EventThread shut down for session: 0x16577368c840054
2018-08-26 10:27:16,411 INFO
[RpcServer.default.FPBQ.Fifo.handler=0,queue=0,port=16000] master.HMaster:
Client=stack//10.17.240.20 move hri=fb79d389cf324aaf3ad4b686d3ad21d5,
source=ve0530.halxg.cloudera.com,16020,1535303391139,
destination=ve0534.halxg.cloudera.com,16020,1535303812180, running balancer
2018-08-26
10:27:17,001 INFO [PEWorker-13] procedure.MasterProcedureScheduler: pid=415,
state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure
hri=fb79d389cf324aaf3ad4b686d3ad21d5, source=ve0530.halxg
.cloudera.com,16020,1535303391139,
destination=ve0534.halxg.cloudera.com,16020,1535303812180 checking lock on
fb79d389cf324aaf3ad4b686d3ad21d5
...
{code}
Looking in UI, it says the below which has NO mention of pid=414 ... like it
disappeared!!!
{code}
413 WAITING stack MoveRegionProcedure
hri=fb79d389cf324aaf3ad4b686d3ad21d5,
source=ve0530.halxg.cloudera.com,16020,1535303391139,
destination=ve0534.halxg.cloudera.com,16020,1535303812180 Sun Aug 26
10:26:15 PDT 2018 Sun Aug 26 10:26:16 PDT 2018 [ { state => [
'1', '2' ] }, { regionId => '1535303851257', tableName => { namespace =>
'ZGVmYXVsdA==', qualifier => 'SW50ZWdyYXRpb25UZXN0QmlnTGlua2VkTGlzdA==' },
startKey => 'mGGOysI+5tUhxwXKql8zRg==', endKey => 'nnnnnnnnnnA=', offline =>
'false', split => 'false', replicaId => '0' }, { sourceServer => { hostName =>
've0530.halxg.cloudera.com', port => '16020', startCode => '1535303391139' },
destinationServer => { hostName => 've0534.halxg.cloudera.com', port =>
'16020', startCode => '1535303812180' } } ]
415 RUNNABLE stack MoveRegionProcedure
hri=fb79d389cf324aaf3ad4b686d3ad21d5,
source=ve0530.halxg.cloudera.com,16020,1535303391139,
destination=ve0534.halxg.cloudera.com,16020,1535303812180 Sun Aug 26
10:27:16 PDT 2018 Sun Aug 26 10:27:16 PDT 2018 [ {}, { regionId
=> '1535303851257', tableName => { namespace => 'ZGVmYXVsdA==', qualifier =>
'SW50ZWdyYXRpb25UZXN0QmlnTGlua2VkTGlzdA==' }, startKey =>
'mGGOysI+5tUhxwXKql8zRg==', endKey => 'nnnnnnnnnnA=', offline => 'false', split
=> 'false', replicaId => '0' }, { sourceServer => { hostName =>
've0530.halxg.cloudera.com', port => '16020', startCode => '1535303391139' },
destinationServer => { hostName => 've0534.halxg.cloudera.com', port =>
'16020', startCode => '1535303812180' } } ]
{code}
Looking at locks in UI, it says that pid=413 has the exclusive lock on this
region...
{code}
REGION: fb79d389cf324aaf3ad4b686d3ad21d5
Lock type: EXCLUSIVE
Owner procedure: { ID => '413', PARENT_ID => '-1', STATE => 'WAITING', OWNER =>
'stack', TYPE => 'MoveRegionProcedure hri=fb79d389cf324aaf3ad4b686d3ad21d5,
source=ve0530.halxg.cloudera.com,16020,1535303391139,
destination=ve0534.halxg.cloudera.com,16020,1535303812180', START_TIME => 'Sun
Aug 26 10:26:15 PDT 2018', LAST_UPDATE => 'Sun Aug 26 10:26:16 PDT 2018',
PARAMETERS => [ { state => [ '1', '2' ] }, { regionId => '1535303851257',
tableName => { namespace => 'ZGVmYXVsdA==', qualifier =>
'SW50ZWdyYXRpb25UZXN0QmlnTGlua2VkTGlzdA==' }, startKey =>
'mGGOysI+5tUhxwXKql8zRg==', endKey => 'nnnnnnnnnnA=', offline => 'false', split
=> 'false', replicaId => '0' }, { sourceServer => { hostName =>
've0530.halxg.cloudera.com', port => '16020', startCode => '1535303391139' },
destinationServer => { hostName => 've0534.halxg.cloudera.com', port =>
'16020', startCode => '1535303812180' } } ] }
{code}
Is it that the child is not picking up the lock? Let me look at the WALs to
see what story with pid=414 is....
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)