[
https://issues.apache.org/jira/browse/HBASE-20671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503728#comment-16503728
]
stack commented on HBASE-20671:
-------------------------------
This is an odd one.
We deleted the merge parent row info column family well before the crash:
2018-05-30 04:26:00,634 DEBUG [PEWorker-9] hbase.META: Delete
{"totalColumns":1,"row":"tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.","families":{"info":[{"qualifier":"","vlen":0,"tag":[],"timestamp":1527654360633}]},"ts":9223372036854775807}
The row is supposed to be GONE from hbase:meta at this point.
... Yet post-crash, three minutes later, we find that at least the
info;regioninfo column is still present (otherwise we'd skip the row in
hbase:meta reload):
2018-05-30 04:29:20,263 INFO
[master/ctr-e138-1518143905142-336066-01-000003:20000]
assignment.RegionStateStore: Load hbase:meta entry
region=a7dd6606dcacc9daf085fc9fa2aecc0c, regionState=null, lastHost=null,
regionLocation=null, seqnum=-1
2018-05-30 04:29:20,263 INFO
[master/ctr-e138-1518143905142-336066-01-000003:20000]
assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c
regionState=null; presuming OFFLINE
2018-05-30 04:29:20,263 INFO
[master/ctr-e138-1518143905142-336066-01-000003:20000] assignment.RegionStates:
Added to offline, CURRENTLY NEVER CLEARED!!! rit=OFFLINE, location=null,
table=tabletwo_merge, region=a7dd6606dcacc9daf085fc9fa2aecc0c
Did this cluster have HBASE-20065? Was it running w/ read replicas?
The RS log is for the hbase:meta host after master comes back but it was not
carrying hbase:meta during interesting time. Do we have the log from that
server? I see that on master start, we open hbase:meta but we do not replay any
logs. Perhaps the previous host did clean shutdown? If it did not, perhaps this
is where we went awry.. we are skipping recovery of hbase:meta?
I ask about HBASE-20065 because it is about master setting timestamp on edits
rather than letting it to the RS. It does set ts in all places so suspicious.
I ask about read replicas because perhaps they are updating state on a
non-existent region? I don't know enough about how these work.
> Merged region brought back to life causing RS to be killed by Master
> --------------------------------------------------------------------
>
> Key: HBASE-20671
> URL: https://issues.apache.org/jira/browse/HBASE-20671
> Project: HBase
> Issue Type: Bug
> Components: amv2
> Affects Versions: 2.0.0
> Reporter: Josh Elser
> Assignee: stack
> Priority: Critical
> Fix For: 2.0.1
>
> Attachments:
> hbase-hbase-master-ctr-e138-1518143905142-336066-01-000003.hwx.site.log.zip,
> hbase-hbase-regionserver-ctr-e138-1518143905142-336066-01-000002.hwx.site.log.zip
>
>
> Another bug coming out of a master restart and replay of the pv2 logs.
> The master merged two regions into one successfully, was restarted, but then
> ended up assigning the children region back out to the cluster. There is a
> log message which appears to indicate that RegionStates acknowledges that it
> doesn't know what this region is as it's replaying the pv2 WAL; however, it
> incorrectly assumes that the region is just OFFLINE and needs to be assigned.
> {noformat}
> 2018-05-30 04:26:00,055 INFO
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=20000] master.HMaster:
> Client=hrt_qa//172.27.85.11 Merge regions a7dd6606dcacc9daf085fc9fa2aecc0c
> and 4017a3c778551d4d258c785d455f9c0b
> 2018-05-30 04:28:27,525 DEBUG
> [master/ctr-e138-1518143905142-336066-01-000003:20000]
> procedure2.ProcedureExecutor: Completed pid=4368, state=SUCCESS;
> MergeTableRegionsProcedure table=tabletwo_merge,
> regions=[a7dd6606dcacc9daf085fc9fa2aecc0c, 4017a3c778551d4d258c785d455f9c0b],
> forcibly=false
> {noformat}
> {noformat}
> 2018-05-30 04:29:20,263 INFO
> [master/ctr-e138-1518143905142-336066-01-000003:20000]
> assignment.AssignmentManager: a7dd6606dcacc9daf085fc9fa2aecc0c
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,263 INFO
> [master/ctr-e138-1518143905142-336066-01-000003:20000]
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!!
> rit=OFFLINE, location=null, table=tabletwo_merge,
> region=a7dd6606dcacc9daf085fc9fa2aecc0c
> 2018-05-30 04:29:20,266 INFO
> [master/ctr-e138-1518143905142-336066-01-000003:20000]
> assignment.AssignmentManager: 4017a3c778551d4d258c785d455f9c0b
> regionState=null; presuming OFFLINE
> 2018-05-30 04:29:20,266 INFO
> [master/ctr-e138-1518143905142-336066-01-000003:20000]
> assignment.RegionStates: Added to offline, CURRENTLY NEVER CLEARED!!!
> rit=OFFLINE, location=null, table=tabletwo_merge,
> region=4017a3c778551d4d258c785d455f9c0b
> {noformat}
> Eventually, the RS reports in its online regions, and the master tells it to
> kill itself:
> {noformat}
> 2018-05-30 04:29:24,272 WARN
> [RpcServer.default.FPBQ.Fifo.handler=26,queue=2,port=20000]
> assignment.AssignmentManager: Killing
> ctr-e138-1518143905142-336066-01-000002.hwx.site,16020,1527654546619: Not
> online: tabletwo_merge,,1527652130538.a7dd6606dcacc9daf085fc9fa2aecc0c.
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)