[
https://issues.apache.org/jira/browse/HBASE-20792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526702#comment-16526702
]
Josh Elser commented on HBASE-20792:
------------------------------------
>From what I can see in logs, we had a cycle where the
>ReopenTableRegionsProcedure kept re-running
>REOPEN_TABLE_REGIONS_CONFIRMED_REOPENED:
{noformat}
2018-06-27 17:16:25,270 INFO [PEWorker-16] procedure2.ProcedureExecutor:
Finished subprocedure(s) of pid=15024, ppid=15023,
state=RUNNABLE:REOPEN_TABLE_REGIONS_CONFIRM_REOPENED;
ReopenTableRegionsProcedure table=table_izljd; resume parent processing.
2018-06-27 17:16:25,270 INFO [PEWorker-16] procedure2.ProcedureExecutor:
Finished pid=320739, ppid=15024, state=SUCCESS; MoveRegionProcedure
hri=523007f77f96474d01d74ed3d048e173,
source=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688,
destination=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688
in 497msec
2018-06-27 17:16:25,279 INFO [PEWorker-2] procedure2.ProcedureExecutor:
Initialized subprocedures=[{pid=320742, ppid=15024,
state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure
hri=523007f77f96474d01d74ed3d048e173,
source=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688,
destination=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688}]
2018-06-27 17:16:25,291 INFO [PEWorker-2] procedure.MasterProcedureScheduler:
pid=320742, ppid=15024, state=RUNNABLE:MOVE_REGION_UNASSIGN;
MoveRegionProcedure hri=523007f77f96474d01d74ed3d048e173,
source=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688,
destination=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688
checking lock on 523007f77f96474d01d74ed3d048e173
2018-06-27 17:16:25,291 INFO [PEWorker-2] procedure2.ProcedureExecutor:
Initialized subprocedures=[{pid=320743, ppid=320742,
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=table_izljd,
region=523007f77f96474d01d74ed3d048e173,
server=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688}]
2018-06-27 17:16:25,292 INFO [PEWorker-2] procedure.MasterProcedureScheduler:
pid=320743, ppid=320742, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
UnassignProcedure table=table_izljd, region=523007f77f96474d01d74ed3d048e173,
server=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688
checking lock on 523007f77f96474d01d74ed3d048e173
2018-06-27 17:16:25,292 INFO [PEWorker-2] assignment.RegionStateStore:
pid=320743 updating hbase:meta row=523007f77f96474d01d74ed3d048e173,
regionState=CLOSING
2018-06-27 17:16:25,295 INFO [PEWorker-2]
assignment.RegionTransitionProcedure: Dispatch pid=320743, ppid=320742,
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=table_izljd,
region=523007f77f96474d01d74ed3d048e173,
server=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688;
rit=CLOSING,
location=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688
2018-06-27 17:16:25,447 INFO [PEWorker-15] assignment.RegionStateStore:
pid=320743 updating hbase:meta row=523007f77f96474d01d74ed3d048e173,
regionState=CLOSED
2018-06-27 17:16:25,471 INFO [PEWorker-15] procedure2.ProcedureExecutor:
Finished subprocedure(s) of pid=320742, ppid=15024,
state=RUNNABLE:MOVE_REGION_ASSIGN; MoveRegionProcedure
hri=523007f77f96474d01d74ed3d048e173,
source=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688,
destination=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688;
resume parent processing.
2018-06-27 17:16:25,471 INFO [PEWorker-7] procedure2.ProcedureExecutor:
Initialized subprocedures=[{pid=320744, ppid=320742,
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=table_izljd,
region=523007f77f96474d01d74ed3d048e173,
target=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688}]
2018-06-27 17:16:25,471 INFO [PEWorker-15] procedure2.ProcedureExecutor:
Finished pid=320743, ppid=320742, state=SUCCESS; UnassignProcedure
table=table_izljd, region=523007f77f96474d01d74ed3d048e173,
server=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688 in
159msec
2018-06-27 17:16:25,574 INFO [PEWorker-7] procedure.MasterProcedureScheduler:
pid=320744, ppid=320742, state=RUNNABLE:REGION_TRANSITION_QUEUE;
AssignProcedure table=table_izljd, region=523007f77f96474d01d74ed3d048e173,
target=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688
checking lock on 523007f77f96474d01d74ed3d048e173
2018-06-27 17:16:25,576 INFO [PEWorker-7] assignment.AssignProcedure: Starting
pid=320744, ppid=320742, state=RUNNABLE:REGION_TRANSITION_QUEUE;
AssignProcedure table=table_izljd, region=523007f77f96474d01d74ed3d048e173,
target=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688;
rit=OFFLINE,
location=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688;
forceNewPlan=false, retain=false target
svr=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688
2018-06-27 17:16:25,726 INFO
[master/ctr-e138-1518143905142-381863-01-000008:20000]
balancer.BaseLoadBalancer: Reassigned 1 regions. 1 retained the pre-restart
assignment.
2018-06-27 17:16:25,726 INFO [PEWorker-6] assignment.RegionStateStore:
pid=320744 updating hbase:meta row=523007f77f96474d01d74ed3d048e173,
regionState=OPENING
2018-06-27 17:16:25,729 INFO [PEWorker-6]
assignment.RegionTransitionProcedure: Dispatch pid=320744, ppid=320742,
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=table_izljd,
region=523007f77f96474d01d74ed3d048e173,
target=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688;
rit=OPENING,
location=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688
2018-06-27 17:16:25,894 INFO [PEWorker-12] assignment.RegionStateStore:
pid=320744 updating hbase:meta row=523007f77f96474d01d74ed3d048e173,
regionState=OPEN, openSeqNum=2,
regionLocation=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688
2018-06-27 17:16:25,900 INFO [PEWorker-12] procedure2.ProcedureExecutor:
Finished subprocedure(s) of pid=320742, ppid=15024, state=RUNNABLE;
MoveRegionProcedure hri=523007f77f96474d01d74ed3d048e173,
source=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688,
destination=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688;
resume parent processing.
2018-06-27 17:16:25,900 INFO [PEWorker-12] procedure2.ProcedureExecutor:
Finished pid=320744, ppid=320742, state=SUCCESS; AssignProcedure
table=table_izljd, region=523007f77f96474d01d74ed3d048e173,
target=ctr-e138-1518143905142-381863-01-000002.hwx.site,16020,1530113920688 in
425msec
{noformat}
I can see in the master log that pid=15023 keeps getting to
REOPEN_TABLE_REGIONS_CONFIRM_REOPENED, executes the Unassign and Assign of this
region on rs02, resumes pid=15023 just to do the same thing. We had another
manifestation of this come across last night.
> info:servername and info:sn inconsistent for OPEN region
> --------------------------------------------------------
>
> Key: HBASE-20792
> URL: https://issues.apache.org/jira/browse/HBASE-20792
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Reporter: Josh Elser
> Assignee: Josh Elser
> Priority: Blocker
> Fix For: 3.0.0, 2.1.0, 2.0.2, 2.2.0
>
> Attachments: HBASE-20792.patch, TestRegionMoveAndAbandon.java,
> hbase-hbase-master-ctr-e138-1518143905142-380753-01-000004.hwx.site.log
>
>
> Next problem we've run into after HBASE-20752 and HBASE-20708
> After a rolling restart of a cluster, we'll see situations where a collection
> of regions will simply not be assigned out to the RS. I was able to reproduce
> this my mimic the restart patterns our tests do internally (ignore whether
> this is the best way to restart nodes for now :)). The general pattern is
> this:
> {code:java}
> for rs in regionservers:
> stop(server, rs, RS)
> for master in masters:
> stop(server, master, MASTER)
> sleep(15)
> for master in masters:
> start(server, master, MASTER)
> for rs in regionservers:
> start(server, rs, RS){code}
> Looking at meta, we can see why the Master is ignoring some regions:
> {noformat}
> test
> column=table:state, timestamp=1529871718998, value=\x08\x00
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:regioninfo, timestamp=1529967103390, value={ENCODED =>
> 0297f680df6dc0166a44f9536346268e, NAME =>
> 'test,,1529871718122.0297f680df6dc0166a44f9536346268e.', STARTKEY
> => '', ENDKEY =>
> ''}
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:seqnumDuringOpen, timestamp=1529967103390,
> value=\x00\x00\x00\x00\x00\x00\x00*
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:server, timestamp=1529967103390,
> value=ctr-e138-1518143905142-378097-02-000012.hwx.site:16020
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:serverstartcode, timestamp=1529967103390, value=1529966776248
> test,,1529871718122.0297f680df6dc0166a44f9536346268e. column=info:sn,
> timestamp=1529967096482,
> value=ctr-e138-1518143905142-378097-02-000006.hwx.site,16020,1529966755170
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:state, timestamp=1529967103390, value=OPEN{noformat}
> The region is marked as {{OPEN}}. The master doesn't know any better.
> However, the interesting bit is that {{info:server}} and {{info:sn}} are
> inconsistent (which, according to the javadoc should not be possible for an
> {{OPEN}} region).{{}}
> This doesn't happen every time, but I caught it yesterday on the 2nd or 3rd
> attempt, so I'm hopeful it's not a bear to repro.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)