[
https://issues.apache.org/jira/browse/HBASE-20792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524188#comment-16524188
]
Josh Elser commented on HBASE-20792:
------------------------------------
Repro'ed it again. Was worried there was something more complicated that I was
missing.
I modified my restart logic to wait a bit before stopping the last RS. This
doesn't cause it every time, but I now have back-to-back "good", then "bad"
logs which is helpful.
{code:java}
for rs in regionservers[:-1]:
stop(server, rs, RS)
# make sure the master has time to get everything on that last RS
time.sleep(15)
stop(server, regionservers[-1], RS)
for master in masters:
stop(server, master, MASTER)
for master in masters:
start(server, master, MASTER)
for rs in regionservers:
start(server, rs, RS){code}
{noformat}
hbase(main):007:0> scan 'hbase:meta', {STARTROW=>'hbase:namespace',
STOPROW=>'hbase:o'}
ROW COLUMN+CELL
hbase:namespace
column=table:state, timestamp=1530043805582, value=\x08\x00
hbase:namespace,,1530043803815.11724acb879200aa8ff0aaeef8c6
column=info:regioninfo, timestamp=1530044902910, value={ENCODED =>
11724acb879200aa8ff0aaeef8c624e5, NAME =>
'hbase:namespace,,1530043803815.11724acb879200aa8ff0aaeef8c624e5.'
24e5. , STARTKEY => '',
ENDKEY => ''}
hbase:namespace,,1530043803815.11724acb879200aa8ff0aaeef8c6
column=info:seqnumDuringOpen, timestamp=1530044902910,
value=\x00\x00\x00\x00\x00\x00\x00\x16
24e5.
hbase:namespace,,1530043803815.11724acb879200aa8ff0aaeef8c6
column=info:server, timestamp=1530044902910,
value=ctr-e138-1518143905142-380753-01-000008.hwx.site:16020
24e5.
hbase:namespace,,1530043803815.11724acb879200aa8ff0aaeef8c6
column=info:serverstartcode, timestamp=1530044902910, value=1530044412656
24e5.
hbase:namespace,,1530043803815.11724acb879200aa8ff0aaeef8c6 column=info:sn,
timestamp=1530044827438,
value=ctr-e138-1518143905142-380753-01-000007.hwx.site,16020,1530044401376
24e5.
hbase:namespace,,1530043803815.11724acb879200aa8ff0aaeef8c6 column=info:state,
timestamp=1530044902910, value=OPEN
24e5.
2 row(s)
Took 0.1913 seconds{noformat}
> info:servername and info:sn inconsistent for OPEN region
> --------------------------------------------------------
>
> Key: HBASE-20792
> URL: https://issues.apache.org/jira/browse/HBASE-20792
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Reporter: Josh Elser
> Assignee: Josh Elser
> Priority: Blocker
> Fix For: 2.0.2
>
>
> Next problem we've run into after HBASE-20752 and HBASE-20708
> After a rolling restart of a cluster, we'll see situations where a collection
> of regions will simply not be assigned out to the RS. I was able to reproduce
> this my mimic the restart patterns our tests do internally (ignore whether
> this is the best way to restart nodes for now :)). The general pattern is
> this:
> {code:java}
> for rs in regionservers:
> stop(server, rs, RS)
> for master in masters:
> stop(server, master, MASTER)
> sleep(15)
> for master in masters:
> start(server, master, MASTER)
> for rs in regionservers:
> start(server, rs, RS){code}
> Looking at meta, we can see why the Master is ignoring some regions:
> {noformat}
> test
> column=table:state, timestamp=1529871718998, value=\x08\x00
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:regioninfo, timestamp=1529967103390, value={ENCODED =>
> 0297f680df6dc0166a44f9536346268e, NAME =>
> 'test,,1529871718122.0297f680df6dc0166a44f9536346268e.', STARTKEY
> => '', ENDKEY =>
> ''}
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:seqnumDuringOpen, timestamp=1529967103390,
> value=\x00\x00\x00\x00\x00\x00\x00*
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:server, timestamp=1529967103390,
> value=ctr-e138-1518143905142-378097-02-000012.hwx.site:16020
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:serverstartcode, timestamp=1529967103390, value=1529966776248
> test,,1529871718122.0297f680df6dc0166a44f9536346268e. column=info:sn,
> timestamp=1529967096482,
> value=ctr-e138-1518143905142-378097-02-000006.hwx.site,16020,1529966755170
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:state, timestamp=1529967103390, value=OPEN{noformat}
> The region is marked as {{OPEN}}. The master doesn't know any better.
> However, the interesting bit is that {{info:server}} and {{info:sn}} are
> inconsistent (which, according to the javadoc should not be possible for an
> {{OPEN}} region).{{}}
> This doesn't happen every time, but I caught it yesterday on the 2nd or 3rd
> attempt, so I'm hopeful it's not a bear to repro.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)