[
https://issues.apache.org/jira/browse/HBASE-20864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Allan Yang updated HBASE-20864:
-------------------------------
Description:
When I was running ITBLL with our internal 2.0.0 version(with 2.0.1 backported
and with other two issues: HBASE-20706, HBASE-20752). I found two of my RS
killed by master since master has a different region state with those RS. It is
very strange that master thought these region should be on a already dead
server. There might be a serious bug, but I haven't found it yet. Here is the
process:
1. e010125048153.bja,60020,1531137365840 is crashed, and clearly
4423e4182457c5b573729be4682cc3a3 was assigned to
e010125049164.bja,60020,1531136465378 during ServerCrashProcedure
{code:java}
2018-07-09 20:03:32,443 INFO [PEWorker-10] procedure.ServerCrashProcedure:
Start pid=2303, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
server=e010125048153.bja,60020,1531137365840, splitWal=true, meta=false
2018-07-09 20:03:39,220 DEBUG
[RpcServer.default.FPBQ.Fifo.handler=294,queue=24,port=60000]
assignment.RegionTransitionProcedure: Received report OPENED seqId=16021,
pid=2305, ppid=2303, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure
table=IntegrationTestBigLinkedList, region=4423e4182457c5b573729be4682cc3a3;
rit=OPENING, location=e010125049164.bja,60020,1531136465378
2018-07-09 20:03:39,220 INFO [PEWorker-13] assignment.RegionStateStore:
pid=2305 updating hbase:meta row=4423e4182457c5b573729be4682cc3a3,
regionState=OPEN, openSeqNum=16021,
regionLocation=e010125049164.bja,60020,1531136465378
2018-07-09 20:03:43,190 INFO [PEWorker-12] procedure2.ProcedureExecutor:
Finished pid=2303, state=SUCCESS; ServerCrashProcedure
server=e010125048153.bja,60020,1531137365840, splitWal=true, meta=false in
10.7490sec
{code}
2. A modify table happened later, the 4423e4182457c5b573729be4682cc3a3 was
reopend on e010125049164.bja,60020,1531136465378
{code:java}
2018-07-09 20:04:39,929 DEBUG
[RpcServer.default.FPBQ.Fifo.handler=295,queue=25,port=60000]
assignment.RegionTransitionProcedure: Received report OPENED seqId=16024,
pid=2351, ppid=2314, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure
table=IntegrationTestBigLinkedList, region=4423e4182457c5b573729be4682cc3a3,
target=e010125049164.bja,60020,1531136465378; rit=OPENING,
location=e010125049164.bja,60020,1531136465378
2018-07-09 20:04:40,554 INFO [PEWorker-6] assignment.RegionStateStore:
pid=2351 updating hbase:meta row=4423e4182457c5b573729be4682cc3a3,
regionState=OPEN, openSeqNum=16024,
regionLocation=e010125049164.bja,60020,1531136465378
{code}
3. Active master was killed, the backup master took over, but when loading meta
entry, it clearly showed 4423e4182457c5b573729be4682cc3a3 is on the privous
dead server e010125048153.bja,60020,1531137365840. That is very very strange!!!
{code:java}
2018-07-09 20:06:17,985 INFO [master/e010125048016:60000]
assignment.RegionStateStore: Load hbase:meta entry
region=4423e4182457c5b573729be4682cc3a3, regionState=OPEN,
lastHost=e010125049164.bja,60020,1531136465378,
regionLocation=e010125048153.bja,60020,1531137365840, openSeqNum=16024
{code}
4. the rs was killed
{code:java}
2018-07-09 20:06:20,265 WARN
[RpcServer.default.FPBQ.Fifo.handler=297,queue=27,port=60000]
assignment.AssignmentManager: Killing e010125049164.bja,60020,1531136465378:
rit=OPEN, location=e010125048153.bja,60020,1531137365840,
table=IntegrationTestBigLinkedList,
region=4423e4182457c5b573729be4682cc3a3reported OPEN on
server=e010125049164.bja,60020,1531136465378 but state has otherwise.
{code}
was:
When I was running ITBLL with our internal 2.0.0 version(with 2.0.1 backported
and with other two issues: HBASE-20706, HBASE-20752). I found two of my RS
killed by master since master has a different region state with those RS. It is
very strange that master thought these region should be on a already dead
server. There might be a serious bug, but I haven't found it yet. Here is the
process:
1. e010125048153.bja,60020,1531137365840 is crashed, and clearly
4423e4182457c5b573729be4682cc3a3 was assigned to
e010125049164.bja,60020,1531136465378 during ServerCrashProcedure
{code:java}
2018-07-09 20:03:32,443 INFO [PEWorker-10] procedure.ServerCrashProcedure:
Start pid=2303, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
server=e010125048153.bja,60020,1531137365840, splitWa
l=true, meta=false
2018-07-09 20:03:39,220 DEBUG
[RpcServer.default.FPBQ.Fifo.handler=294,queue=24,port=60000]
assignment.RegionTransitionProcedure: Received report OPENED seqId=16021,
pid=2305, ppid=2303, state=RUNNABLE
:REGION_TRANSITION_DISPATCH; AssignProcedure
table=IntegrationTestBigLinkedList, region=4423e4182457c5b573729be4682cc3a3;
rit=OPENING, location=e010125049164.bja,60020,1531136465378
2018-07-09 20:03:39,220 INFO [PEWorker-13] assignment.RegionStateStore:
pid=2305 updating hbase:meta row=4423e4182457c5b573729be4682cc3a3,
regionState=OPEN, openSeqNum=16021, regionLocation=e010125049
164.bja,60020,1531136465378
2018-07-09 20:03:43,190 INFO [PEWorker-12] procedure2.ProcedureExecutor:
Finished pid=2303, state=SUCCESS; ServerCrashProcedure
server=e010125048153.bja,60020,1531137365840, splitWal=true, meta=false
in 10.7490sec
{code}
2. A modify table happened later, the 4423e4182457c5b573729be4682cc3a3 was
reopend on e010125049164.bja,60020,1531136465378
{code:java}
2018-07-09 20:04:39,929 DEBUG
[RpcServer.default.FPBQ.Fifo.handler=295,queue=25,port=60000]
assignment.RegionTransitionProcedure: Received report OPENED seqId=16024,
pid=2351, ppid=2314, state=RUNNABLE
:REGION_TRANSITION_DISPATCH; AssignProcedure
table=IntegrationTestBigLinkedList, region=4423e4182457c5b573729be4682cc3a3,
target=e010125049164.bja,60020,1531136465378; rit=OPENING, location=e0101250491
64.bja,60020,1531136465378
2018-07-09 20:04:40,554 INFO [PEWorker-6] assignment.RegionStateStore:
pid=2351 updating hbase:meta row=4423e4182457c5b573729be4682cc3a3,
regionState=OPEN, openSeqNum=16024, regionLocation=e0101250491
64.bja,60020,1531136465378
{code}
3. Active master was killed, the backup master took over, but when loading meta
entry, it clearly showed 4423e4182457c5b573729be4682cc3a3 is on the privous
dead server e010125048153.bja,60020,1531137365840. That is very very strange!!!
{code:java}
2018-07-09 20:06:17,985 INFO [master/e010125048016:60000]
assignment.RegionStateStore: Load hbase:meta entry
region=4423e4182457c5b573729be4682cc3a3, regionState=OPEN,
lastHost=e010125049164.bja,60020
,1531136465378, regionLocation=e010125048153.bja,60020,1531137365840,
openSeqNum=16024
{code}
4. the rs was killed
{code:java}
2018-07-09 20:06:20,265 WARN
[RpcServer.default.FPBQ.Fifo.handler=297,queue=27,port=60000]
assignment.AssignmentManager: Killing e010125049164.bja,60020,1531136465378:
rit=OPEN, location=e010125048153
.bja,60020,1531137365840, table=IntegrationTestBigLinkedList,
region=4423e4182457c5b573729be4682cc3a3reported OPEN on
server=e010125049164.bja,60020,1531136465378 but state has otherwise.
{code}
> RS was killed due to master thought the region should be on a already dead
> server
> ---------------------------------------------------------------------------------
>
> Key: HBASE-20864
> URL: https://issues.apache.org/jira/browse/HBASE-20864
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.0.0
> Reporter: Allan Yang
> Priority: Major
>
> When I was running ITBLL with our internal 2.0.0 version(with 2.0.1
> backported and with other two issues: HBASE-20706, HBASE-20752). I found two
> of my RS killed by master since master has a different region state with
> those RS. It is very strange that master thought these region should be on a
> already dead server. There might be a serious bug, but I haven't found it
> yet. Here is the process:
> 1. e010125048153.bja,60020,1531137365840 is crashed, and clearly
> 4423e4182457c5b573729be4682cc3a3 was assigned to
> e010125049164.bja,60020,1531136465378 during ServerCrashProcedure
> {code:java}
> 2018-07-09 20:03:32,443 INFO [PEWorker-10] procedure.ServerCrashProcedure:
> Start pid=2303, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
> server=e010125048153.bja,60020,1531137365840, splitWal=true, meta=false
> 2018-07-09 20:03:39,220 DEBUG
> [RpcServer.default.FPBQ.Fifo.handler=294,queue=24,port=60000]
> assignment.RegionTransitionProcedure: Received report OPENED seqId=16021,
> pid=2305, ppid=2303, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
> AssignProcedure table=IntegrationTestBigLinkedList,
> region=4423e4182457c5b573729be4682cc3a3; rit=OPENING,
> location=e010125049164.bja,60020,1531136465378
> 2018-07-09 20:03:39,220 INFO [PEWorker-13] assignment.RegionStateStore:
> pid=2305 updating hbase:meta row=4423e4182457c5b573729be4682cc3a3,
> regionState=OPEN, openSeqNum=16021,
> regionLocation=e010125049164.bja,60020,1531136465378
> 2018-07-09 20:03:43,190 INFO [PEWorker-12] procedure2.ProcedureExecutor:
> Finished pid=2303, state=SUCCESS; ServerCrashProcedure
> server=e010125048153.bja,60020,1531137365840, splitWal=true, meta=false in
> 10.7490sec
> {code}
> 2. A modify table happened later, the 4423e4182457c5b573729be4682cc3a3 was
> reopend on e010125049164.bja,60020,1531136465378
> {code:java}
> 2018-07-09 20:04:39,929 DEBUG
> [RpcServer.default.FPBQ.Fifo.handler=295,queue=25,port=60000]
> assignment.RegionTransitionProcedure: Received report OPENED seqId=16024,
> pid=2351, ppid=2314, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
> AssignProcedure table=IntegrationTestBigLinkedList,
> region=4423e4182457c5b573729be4682cc3a3,
> target=e010125049164.bja,60020,1531136465378; rit=OPENING,
> location=e010125049164.bja,60020,1531136465378
> 2018-07-09 20:04:40,554 INFO [PEWorker-6] assignment.RegionStateStore:
> pid=2351 updating hbase:meta row=4423e4182457c5b573729be4682cc3a3,
> regionState=OPEN, openSeqNum=16024,
> regionLocation=e010125049164.bja,60020,1531136465378
> {code}
> 3. Active master was killed, the backup master took over, but when loading
> meta entry, it clearly showed 4423e4182457c5b573729be4682cc3a3 is on the
> privous dead server e010125048153.bja,60020,1531137365840. That is very very
> strange!!!
> {code:java}
> 2018-07-09 20:06:17,985 INFO [master/e010125048016:60000]
> assignment.RegionStateStore: Load hbase:meta entry
> region=4423e4182457c5b573729be4682cc3a3, regionState=OPEN,
> lastHost=e010125049164.bja,60020,1531136465378,
> regionLocation=e010125048153.bja,60020,1531137365840, openSeqNum=16024
> {code}
> 4. the rs was killed
> {code:java}
> 2018-07-09 20:06:20,265 WARN
> [RpcServer.default.FPBQ.Fifo.handler=297,queue=27,port=60000]
> assignment.AssignmentManager: Killing e010125049164.bja,60020,1531136465378:
> rit=OPEN, location=e010125048153.bja,60020,1531137365840,
> table=IntegrationTestBigLinkedList,
> region=4423e4182457c5b573729be4682cc3a3reported OPEN on
> server=e010125049164.bja,60020,1531136465378 but state has otherwise.
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)