[ 
https://issues.apache.org/jira/browse/HBASE-20792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523969#comment-16523969
 ] 

Josh Elser commented on HBASE-20792:
------------------------------------

Ok: while the upgrade on my cluster runs, I found that we had another 
reproduction of this last night on a build which did include HBASE-20752.

Here's the play-by-play:
 * hbase:namespace is on rs6
 * move request comes in to put hbase:namespace on rs5
 * hbase:namespace closes on r6, opens on rs5
 * The above restart process begins
 * rs5 is killed, hbase:namespace moves back to rs6 (which is the only 
regionserver available at this point)
 * hbase:namespace transitions to OPEN on rs6 with seqId=74
 * Master is restarted
 * RegionStateStore reports that hbase:namespace is OPEN on rs5 with 
openSeqNum=74

{noformat}
2018-06-26 10:58:10,850 INFO  
[master/ctr-e138-1518143905142-380046-01-000003:20000] 
assignment.RegionStateStore: Load hbase:meta entry 
region=fd488f7ed3f19beab5368769d9e95a75, regionState=OPEN, 
lastHost=ctr-e138-1518143905142-380046-01-000006.hwx.site,16020,1530007944484, 
regionLocation=ctr-e138-1518143905142-380046-01-000005.hwx.site,16020,1530007936525,
 openSeqNum=74
2018-06-26 10:58:10,850 DEBUG 
[master/ctr-e138-1518143905142-380046-01-000003:20000] assignment.RegionStates: 
setting 
location=ctr-e138-1518143905142-380046-01-000005.hwx.site,16020,1530007936525 
for rit=OPEN, 
location=ctr-e138-1518143905142-380046-01-000005.hwx.site,16020,1530007936525, 
table=hbase:namespace, region=fd488f7ed3f19beab5368769d9e95a75 last loc=null
{noformat}
That last bit seems to be the majorly screwed up part. Digging into the master 
coming back and how it got to this state.

> info:servername and info:sn inconsistent for OPEN region
> --------------------------------------------------------
>
>                 Key: HBASE-20792
>                 URL: https://issues.apache.org/jira/browse/HBASE-20792
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Blocker
>             Fix For: 2.0.2
>
>
> Next problem we've run into after HBASE-20752 and HBASE-20708
> After a rolling restart of a cluster, we'll see situations where a collection 
> of regions will simply not be assigned out to the RS. I was able to reproduce 
> this my mimic the restart patterns our tests do internally (ignore whether 
> this is the best way to restart nodes for now :)). The general pattern is 
> this:
> {code:java}
> for rs in regionservers:
>   stop(server, rs, RS)
> for master in masters:
>   stop(server, master, MASTER)
> sleep(15)
> for master in masters:
>   start(server, master, MASTER)
> for rs in regionservers:
>   start(server, rs, RS){code}
> Looking at meta, we can see why the Master is ignoring some regions:
> {noformat}
>  test                                                        
> column=table:state, timestamp=1529871718998, value=\x08\x00
>  test,,1529871718122.0297f680df6dc0166a44f9536346268e.       
> column=info:regioninfo, timestamp=1529967103390, value={ENCODED => 
> 0297f680df6dc0166a44f9536346268e, NAME => 
> 'test,,1529871718122.0297f680df6dc0166a44f9536346268e.', STARTKEY
>                                                              => '', ENDKEY => 
> ''}
>  test,,1529871718122.0297f680df6dc0166a44f9536346268e.       
> column=info:seqnumDuringOpen, timestamp=1529967103390, 
> value=\x00\x00\x00\x00\x00\x00\x00*
>  test,,1529871718122.0297f680df6dc0166a44f9536346268e.       
> column=info:server, timestamp=1529967103390, 
> value=ctr-e138-1518143905142-378097-02-000012.hwx.site:16020
>  test,,1529871718122.0297f680df6dc0166a44f9536346268e.       
> column=info:serverstartcode, timestamp=1529967103390, value=1529966776248
>  test,,1529871718122.0297f680df6dc0166a44f9536346268e.       column=info:sn, 
> timestamp=1529967096482, 
> value=ctr-e138-1518143905142-378097-02-000006.hwx.site,16020,1529966755170
>  test,,1529871718122.0297f680df6dc0166a44f9536346268e.       
> column=info:state, timestamp=1529967103390, value=OPEN{noformat}
> The region is marked as {{OPEN}}. The master doesn't know any better. 
> However, the interesting bit is that {{info:server}} and {{info:sn}} are 
> inconsistent (which, according to the javadoc should not be possible for an 
> {{OPEN}} region).{{}}
> This doesn't happen every time, but I caught it yesterday on the 2nd or 3rd 
> attempt, so I'm hopeful it's not a bear to repro.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to