[
https://issues.apache.org/jira/browse/HBASE-20792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524696#comment-16524696
]
Duo Zhang commented on HBASE-20792:
-----------------------------------
Rewrite the UT a bit so that it could fail 100%. Just kill master before
killing RS2, and then restart the cluster, the cluster will hang on
initializing TableNamespaceManager.
Add a log in updateUserRegionLocation
{code}
LOG.info("==============" + regionInfo + ", " + state + ", " +
regionLocation + ", " +
lastHost + ", " + openSeqNum + ", pid");
{code}
{noformat}
2018-06-27 15:06:26,169 INFO [PEWorker-9] assignment.RegionStateStore(159):
=============={ENCODED => fc56c15d5e34ba87d787b232b056c69f, NAME =>
'hbase:namespace,,1530083185741.fc56c15d5e34ba87d787b232b056c69f.', STARTKEY =>
'', ENDKEY => ''}, OPENING, zhangduo-ubuntu,36761,1530083181798, null, -1, pid
2018-06-27 15:06:26,354 INFO [PEWorker-10] assignment.RegionStateStore(159):
=============={ENCODED => fc56c15d5e34ba87d787b232b056c69f, NAME =>
'hbase:namespace,,1530083185741.fc56c15d5e34ba87d787b232b056c69f.', STARTKEY =>
'', ENDKEY => ''}, OPEN, zhangduo-ubuntu,36761,1530083181798, null, 2, pid
2018-06-27 15:06:26,888 INFO [PEWorker-14] assignment.RegionStateStore(159):
=============={ENCODED => fc56c15d5e34ba87d787b232b056c69f, NAME =>
'hbase:namespace,,1530083185741.fc56c15d5e34ba87d787b232b056c69f.', STARTKEY =>
'', ENDKEY => ''}, CLOSING, zhangduo-ubuntu,36761,1530083181798, null, -1, pid
2018-06-27 15:06:27,169 INFO [PEWorker-16] assignment.RegionStateStore(159):
=============={ENCODED => fc56c15d5e34ba87d787b232b056c69f, NAME =>
'hbase:namespace,,1530083185741.fc56c15d5e34ba87d787b232b056c69f.', STARTKEY =>
'', ENDKEY => ''}, CLOSED, null, zhangduo-ubuntu,36761,1530083181798, -1, pid
2018-06-27 15:06:27,327 INFO [PEWorker-3] assignment.RegionStateStore(159):
=============={ENCODED => fc56c15d5e34ba87d787b232b056c69f, NAME =>
'hbase:namespace,,1530083185741.fc56c15d5e34ba87d787b232b056c69f.', STARTKEY =>
'', ENDKEY => ''}, OPENING, zhangduo-ubuntu,36131,1530083181909,
zhangduo-ubuntu,36761,1530083181798, -1, pid
2018-06-27 15:06:27,498 INFO [PEWorker-4] assignment.RegionStateStore(159):
=============={ENCODED => fc56c15d5e34ba87d787b232b056c69f, NAME =>
'hbase:namespace,,1530083185741.fc56c15d5e34ba87d787b232b056c69f.', STARTKEY =>
'', ENDKEY => ''}, OPEN, zhangduo-ubuntu,36131,1530083181909,
zhangduo-ubuntu,36761,1530083181798, 10, pid
2018-06-27 15:06:27,675 INFO [PEWorker-6] assignment.RegionStateStore(159):
=============={ENCODED => fc56c15d5e34ba87d787b232b056c69f, NAME =>
'hbase:namespace,,1530083185741.fc56c15d5e34ba87d787b232b056c69f.', STARTKEY =>
'', ENDKEY => ''}, CLOSING, zhangduo-ubuntu,36131,1530083181909,
zhangduo-ubuntu,36761,1530083181798, -1, pid
2018-06-27 15:06:27,849 INFO [PEWorker-8] assignment.RegionStateStore(159):
=============={ENCODED => fc56c15d5e34ba87d787b232b056c69f, NAME =>
'hbase:namespace,,1530083185741.fc56c15d5e34ba87d787b232b056c69f.', STARTKEY =>
'', ENDKEY => ''}, CLOSED, null, zhangduo-ubuntu,36131,1530083181909, -1, pid
2018-06-27 15:06:28,007 INFO [PEWorker-11] assignment.RegionStateStore(159):
=============={ENCODED => fc56c15d5e34ba87d787b232b056c69f, NAME =>
'hbase:namespace,,1530083185741.fc56c15d5e34ba87d787b232b056c69f.', STARTKEY =>
'', ENDKEY => ''}, OPENING, zhangduo-ubuntu,36761,1530083181798,
zhangduo-ubuntu,36131,1530083181909, -1, pid
2018-06-27 15:06:28,177 INFO [PEWorker-12] assignment.RegionStateStore(159):
=============={ENCODED => fc56c15d5e34ba87d787b232b056c69f, NAME =>
'hbase:namespace,,1530083185741.fc56c15d5e34ba87d787b232b056c69f.', STARTKEY =>
'', ENDKEY => ''}, OPEN, zhangduo-ubuntu,36761,1530083181798,
zhangduo-ubuntu,36131,1530083181909, 13, pid
2018-06-27 15:06:29,401 INFO [PEWorker-8] assignment.RegionStateStore(159):
=============={ENCODED => fc56c15d5e34ba87d787b232b056c69f, NAME =>
'hbase:namespace,,1530083185741.fc56c15d5e34ba87d787b232b056c69f.', STARTKEY =>
'', ENDKEY => ''}, OPENING, zhangduo-ubuntu,36131,1530083181909,
zhangduo-ubuntu,36131,1530083181909, -1, pid
2018-06-27 15:06:29,573 INFO [PEWorker-10] assignment.RegionStateStore(159):
=============={ENCODED => fc56c15d5e34ba87d787b232b056c69f, NAME =>
'hbase:namespace,,1530083185741.fc56c15d5e34ba87d787b232b056c69f.', STARTKEY =>
'', ENDKEY => ''}, OPEN, zhangduo-ubuntu,36131,1530083181909,
zhangduo-ubuntu,36131,1530083181909, 13, pid
{noformat}
Just as I expected, the reason is that we do not set the lastHost in SCP. So
when we mark the region as OPENING in SCP, we will find that the lastHost is
the same with the region location and then skip the updating, and finally cause
the incosistency between 'sn' and 'servername', and 'servername' is the correct
value. But when restarting, we read the region location from 'sn' instead of
'servername', and the region is in OPEN state, then we stuck there.
Let me prepare a fix.
> info:servername and info:sn inconsistent for OPEN region
> --------------------------------------------------------
>
> Key: HBASE-20792
> URL: https://issues.apache.org/jira/browse/HBASE-20792
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Reporter: Josh Elser
> Assignee: Josh Elser
> Priority: Blocker
> Fix For: 2.0.2
>
> Attachments: TestRegionMoveAndAbandon.java,
> hbase-hbase-master-ctr-e138-1518143905142-380753-01-000004.hwx.site.log
>
>
> Next problem we've run into after HBASE-20752 and HBASE-20708
> After a rolling restart of a cluster, we'll see situations where a collection
> of regions will simply not be assigned out to the RS. I was able to reproduce
> this my mimic the restart patterns our tests do internally (ignore whether
> this is the best way to restart nodes for now :)). The general pattern is
> this:
> {code:java}
> for rs in regionservers:
> stop(server, rs, RS)
> for master in masters:
> stop(server, master, MASTER)
> sleep(15)
> for master in masters:
> start(server, master, MASTER)
> for rs in regionservers:
> start(server, rs, RS){code}
> Looking at meta, we can see why the Master is ignoring some regions:
> {noformat}
> test
> column=table:state, timestamp=1529871718998, value=\x08\x00
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:regioninfo, timestamp=1529967103390, value={ENCODED =>
> 0297f680df6dc0166a44f9536346268e, NAME =>
> 'test,,1529871718122.0297f680df6dc0166a44f9536346268e.', STARTKEY
> => '', ENDKEY =>
> ''}
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:seqnumDuringOpen, timestamp=1529967103390,
> value=\x00\x00\x00\x00\x00\x00\x00*
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:server, timestamp=1529967103390,
> value=ctr-e138-1518143905142-378097-02-000012.hwx.site:16020
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:serverstartcode, timestamp=1529967103390, value=1529966776248
> test,,1529871718122.0297f680df6dc0166a44f9536346268e. column=info:sn,
> timestamp=1529967096482,
> value=ctr-e138-1518143905142-378097-02-000006.hwx.site,16020,1529966755170
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:state, timestamp=1529967103390, value=OPEN{noformat}
> The region is marked as {{OPEN}}. The master doesn't know any better.
> However, the interesting bit is that {{info:server}} and {{info:sn}} are
> inconsistent (which, according to the javadoc should not be possible for an
> {{OPEN}} region).{{}}
> This doesn't happen every time, but I caught it yesterday on the 2nd or 3rd
> attempt, so I'm hopeful it's not a bear to repro.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)