[
https://issues.apache.org/jira/browse/HBASE-20792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526975#comment-16526975
]
Duo Zhang commented on HBASE-20792:
-----------------------------------
{quote}
So, what I'm gathering is that we pull out a seqId=1 on open which makes our
openSeqNum=2 when we actually complete the OPEN. However, when we close the
region down, we don't update that new seqNum anywhere; thus, the next time we
open the region, we get 2 again. It's not an issue from a data correctness (I
think..), but it seems to cause this inf+loop in RTRP as we see.
{quote}
Which branch [~elserj]? 2.1 or 2.0?
There is a issue for the openSeqNum bumping
https://issues.apache.org/jira/browse/HBASE-20242, but it is only committed to
branch-2.1+ I believe. And a successful reopen should bump the openSeqNum as we
write a open region event to the WAL and when replaying the sequence number
will be increased?
Anyway, there is a way to rider over this is that, even if the region is in
OPEN state before, beside comparing the openSeqNum, we can also check the
location, if the location is changed then we can also make sure that the region
is reopened.
But I still there maybe other problems as the openSeqNum should be bumped
during reopening...
> info:servername and info:sn inconsistent for OPEN region
> --------------------------------------------------------
>
> Key: HBASE-20792
> URL: https://issues.apache.org/jira/browse/HBASE-20792
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Reporter: Josh Elser
> Assignee: Josh Elser
> Priority: Blocker
> Fix For: 3.0.0, 2.1.0, 2.0.2, 2.2.0
>
> Attachments: HBASE-20792.patch, TestRegionMoveAndAbandon.java,
> hbase-hbase-master-ctr-e138-1518143905142-380753-01-000004.hwx.site.log
>
>
> Next problem we've run into after HBASE-20752 and HBASE-20708
> After a rolling restart of a cluster, we'll see situations where a collection
> of regions will simply not be assigned out to the RS. I was able to reproduce
> this my mimic the restart patterns our tests do internally (ignore whether
> this is the best way to restart nodes for now :)). The general pattern is
> this:
> {code:java}
> for rs in regionservers:
> stop(server, rs, RS)
> for master in masters:
> stop(server, master, MASTER)
> sleep(15)
> for master in masters:
> start(server, master, MASTER)
> for rs in regionservers:
> start(server, rs, RS){code}
> Looking at meta, we can see why the Master is ignoring some regions:
> {noformat}
> test
> column=table:state, timestamp=1529871718998, value=\x08\x00
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:regioninfo, timestamp=1529967103390, value={ENCODED =>
> 0297f680df6dc0166a44f9536346268e, NAME =>
> 'test,,1529871718122.0297f680df6dc0166a44f9536346268e.', STARTKEY
> => '', ENDKEY =>
> ''}
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:seqnumDuringOpen, timestamp=1529967103390,
> value=\x00\x00\x00\x00\x00\x00\x00*
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:server, timestamp=1529967103390,
> value=ctr-e138-1518143905142-378097-02-000012.hwx.site:16020
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:serverstartcode, timestamp=1529967103390, value=1529966776248
> test,,1529871718122.0297f680df6dc0166a44f9536346268e. column=info:sn,
> timestamp=1529967096482,
> value=ctr-e138-1518143905142-378097-02-000006.hwx.site,16020,1529966755170
> test,,1529871718122.0297f680df6dc0166a44f9536346268e.
> column=info:state, timestamp=1529967103390, value=OPEN{noformat}
> The region is marked as {{OPEN}}. The master doesn't know any better.
> However, the interesting bit is that {{info:server}} and {{info:sn}} are
> inconsistent (which, according to the javadoc should not be possible for an
> {{OPEN}} region).{{}}
> This doesn't happen every time, but I caught it yesterday on the 2nd or 3rd
> attempt, so I'm hopeful it's not a bear to repro.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)