[
https://issues.apache.org/jira/browse/HBASE-20792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524493#comment-16524493
]
Josh Elser commented on HBASE-20792:
------------------------------------
Hey [~Apache9] – this is a few days old branch-2.0 with your HBASE-20708 pulled
back onto it. I didn't realize this one didn't land onto branch-2.0 already
(was there a reason for that?). It seems to have helped best I could tell :)
{quote}Could you please find the log in
RegionStateStore.updateUserRegionLocation for the broken region? It is
something like this:
{quote}
Hah, funny you should ask. This is where I'm currently poking. What I believe
to be happening is that when we SCP->AssignProc for this region goes to update
the OPENING state, we don't actually update {{info:sn}} like the code implies
it should. Note, there is some more logging below that is from me hacking on
things.
{noformat}
2018-06-27 02:14:34,803 TRACE [PEWorker-15] assignment.AssignProcedure: Update
pid=18, ppid=17, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure
table=hbase:namespace
, region=837a84143c4cd17952282464dfdcfd55; rit=OFFLINE,
location=ctr-e138-1518143905142-380753-01-000008.hwx.site,16020,1530065530163
2018-06-27 02:14:34,804 INFO [PEWorker-15] assignment.RegionStateStore: pid=18
updating hbase:meta row=837a84143c4cd17952282464dfdcfd55, regionState=OPENING
2018-06-27 02:14:34,912 INFO [PEWorker-15]
assignment.RegionTransitionProcedure: Dispatch pid=18, ppid=17,
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=hba
se:namespace, region=837a84143c4cd17952282464dfdcfd55; rit=OPENING,
location=ctr-e138-1518143905142-380753-01-000008.hwx.site,16020,1530065530163
2018-06-27 02:14:35,148 TRACE
[RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=16000]
assignment.AssignmentManager: Update region transition
serverName=ctr-e138-1518143905
142-380753-01-000008.hwx.site,16020,1530065530163 region=rit=OPENING,
location=ctr-e138-1518143905142-380753-01-000008.hwx.site,16020,1530065530163,
table=hbase:namespace, regi
on=837a84143c4cd17952282464dfdcfd55 regionState=OPENED
2018-06-27 02:14:35,149 DEBUG
[RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=16000]
assignment.RegionTransitionProcedure: Received report OPENED seqId=16, pid=18,
ppid=1
7, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure
table=hbase:namespace, region=837a84143c4cd17952282464dfdcfd55; rit=OPENING,
location=ctr-e138-1518143905142-38075
3-01-000008.hwx.site,16020,1530065530163
2018-06-27 02:14:35,149 DEBUG [PEWorker-1]
assignment.RegionTransitionProcedure: Finishing pid=18, ppid=17,
state=RUNNABLE:REGION_TRANSITION_FINISH; AssignProcedure table=hbase
:namespace, region=837a84143c4cd17952282464dfdcfd55; rit=OPENING,
location=ctr-e138-1518143905142-380753-01-000008.hwx.site,16020,1530065530163
2018-06-27 02:14:35,149 DEBUG [PEWorker-1] assignment.RegionStateStore:
openSeqNum=16, adding location of
ctr-e138-1518143905142-380753-01-000008.hwx.site,16020,1530065530163 f
or 837a84143c4cd17952282464dfdcfd55
2018-06-27 02:14:35,150 INFO [PEWorker-1] assignment.RegionStateStore: pid=18
updating hbase:meta row=837a84143c4cd17952282464dfdcfd55, regionState=OPEN,
openSeqNum=16, region
Location=ctr-e138-1518143905142-380753-01-000008.hwx.site,16020,1530065530163{noformat}
Let me try to summarize what I think is happening for you (everyone). Consider
one region "A" and two RS "rs1" and "rs2". The final result is that "A" is left
unassigned by HBase but marked as OPEN in meta:
* "A" is on "rs2"
* move "A", "rs1"
* kill "rs1"
* SCP runs for "rs1"
** AP/RegionTransitionsProcedure runs for "A", OFFLINE'ing and then assigning
to "rs2"
** {{info:sn}} is never updated with the OPENING state, but this is OK since
the region does actually OPEN on "RS2"
* kill "rs2"
* restart master
* Master doesn't assign "A" because it sees {{info:state=OPEN}},
{{info:sn=rs1}}, {{info:server=rs2}}.
> info:servername and info:sn inconsistent for OPEN region
> --------------------------------------------------------
>
> Key: HBASE-20792
> URL: https://issues.apache.org/jira/browse/HBASE-20792
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Reporter: Josh Elser
> Assignee: Josh Elser
> Priority: Blocker
> Fix For: 2.0.2
>
>
> Next problem we've run into after HBASE-20752 and HBASE-20708
> After a rolling restart of a cluster, we'll see situations where a collection
> of regions will simply not be assigned out to the RS. I was able to reproduce
> this my mimic the restart patterns our tests do internally (ignore whether
> this is the best way to restart nodes for now :)). The general pattern is
> this:
> {code:java}
> for rs in regionservers:
> stop(server, rs, RS)
> for master in masters:
> stop(server, master, MASTER)
> sleep(15)
> for master in masters:
> start(server, master, MASTER)
> for rs in regionservers:
> start(server, rs, RS){code}
> Looking at meta, we can see why the Master is ignoring some regions:
> {noformat}
> test
> column=table:state, timestamp=1529871718998, value=\x08\x00
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:regioninfo, timestamp=1529967103390, value={ENCODED =>
> 0297f680df6dc0166a44f9536346268e, NAME =>
> 'test,,1529871718122.0297f680df6dc0166a44f9536346268e.', STARTKEY
> => '', ENDKEY =>
> ''}
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:seqnumDuringOpen, timestamp=1529967103390,
> value=\x00\x00\x00\x00\x00\x00\x00*
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:server, timestamp=1529967103390,
> value=ctr-e138-1518143905142-378097-02-000012.hwx.site:16020
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:serverstartcode, timestamp=1529967103390, value=1529966776248
> test,,1529871718122.0297f680df6dc0166a44f9536346268e. column=info:sn,
> timestamp=1529967096482,
> value=ctr-e138-1518143905142-378097-02-000006.hwx.site,16020,1529966755170
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:state, timestamp=1529967103390, value=OPEN{noformat}
> The region is marked as {{OPEN}}. The master doesn't know any better.
> However, the interesting bit is that {{info:server}} and {{info:sn}} are
> inconsistent (which, according to the javadoc should not be possible for an
> {{OPEN}} region).{{}}
> This doesn't happen every time, but I caught it yesterday on the 2nd or 3rd
> attempt, so I'm hopeful it's not a bear to repro.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)