[ 
https://issues.apache.org/jira/browse/HBASE-20792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523978#comment-16523978
 ] 

Josh Elser commented on HBASE-20792:
------------------------------------

Seems like after we handled the meta assignment for rs6, we just "forgot" about 
reassigning the non-system table regions from that same host.
{noformat}
2018-06-26 10:57:23,614 DEBUG 
[RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=20000] 
assignment.RegionTransitionProcedure: Received report OPENED seqId=900, 
pid=437, ppid=436, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure 
table=hbase:meta, region=1588230740; rit=OPENING, 
location=ctr-e138-1518143905142-380046-01-000002.hwx.site,16020,1530010633255
2018-06-26 10:57:23,615 DEBUG [PEWorker-3] 
assignment.RegionTransitionProcedure: Finishing pid=437, ppid=436, 
state=RUNNABLE:REGION_TRANSITION_FINISH; AssignProcedure table=hbase:meta, 
region=1588230740; rit=OPENING, 
location=ctr-e138-1518143905142-380046-01-000002.hwx.site,16020,1530010633255
2018-06-26 10:57:23,615 INFO  [PEWorker-3] zookeeper.MetaTableLocator: Setting 
hbase:meta (replicaId=0) location in ZooKeeper as 
ctr-e138-1518143905142-380046-01-000002.hwx.site,16020,1530010633255
2018-06-26 10:57:23,624 INFO  [PEWorker-3] procedure2.ProcedureExecutor: 
Finished subprocedure(s) of pid=436, state=RUNNABLE:SERVER_CRASH_GET_REGIONS; 
ServerCrashProcedure 
server=ctr-e138-1518143905142-380046-01-000006.hwx.site,16020,1530007944484, 
splitWal=true, meta=true; resume parent processing.
2018-06-26 10:57:23,624 INFO  [PEWorker-3] procedure2.ProcedureExecutor: 
Finished pid=437, ppid=436, state=SUCCESS; AssignProcedure table=hbase:meta, 
region=1588230740 in 1mins, 32.386sec
{noformat}
After this, we wait for the RS report timeout (30s), see meta in the goofy 
state from up-above, then finish up SCP:
{noformat}
2018-06-26 10:58:10,856 DEBUG [PEWorker-2] procedure.ServerCrashProcedure: 
Splitting WALs pid=436, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; 
ServerCrashProcedure 
server=ctr-e138-1518143905142-380046-01-000006.hwx.site,16020,1530007944484, 
splitWal=true, meta=true
2018-06-26 10:58:10,858 INFO  [PEWorker-2] master.MasterWalManager: Log dir for 
server ctr-e138-1518143905142-380046-01-000006.hwx.site,16020,1530007944484 
does not exist
2018-06-26 10:58:10,858 INFO  [PEWorker-2] master.SplitLogManager: dead 
splitlog workers 
[ctr-e138-1518143905142-380046-01-000006.hwx.site,16020,1530007944484]
2018-06-26 10:58:10,859 INFO  [PEWorker-2] master.SplitLogManager: Started 
splitting 0 logs in [] for 
[ctr-e138-1518143905142-380046-01-000006.hwx.site,16020,1530007944484]
2018-06-26 10:58:10,860 INFO  [PEWorker-2] master.SplitLogManager: finished 
splitting (more than or equal to) 0 bytes in 0 log files in [] in 1ms
2018-06-26 10:58:10,860 DEBUG [PEWorker-2] procedure.ServerCrashProcedure: Done 
splitting WALs pid=436, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; 
ServerCrashProcedure 
server=ctr-e138-1518143905142-380046-01-000006.hwx.site,16020,1530007944484, 
splitWal=true, meta=true
2018-06-26 10:58:10,862 DEBUG [WALProcedureStoreSyncThread] 
wal.WALProcedureStore: hsync completed for 
hdfs://ctr-e138-1518143905142-380046-01-000003.hwx.site:8020/apps/hbase/data/MasterProcWALs/pv2-00000000000000000007.log
2018-06-26 10:58:10,864 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
Initialized subprocedures=[{pid=438, ppid=436, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=backup:system_bulk, region=b3d46c86aac9e28cda327c750a919050}, {pid=439, 
ppid=436, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=namespace_vbgi9x:namespacetable, 
region=fba131dea57789bfdcec0675bfdb2955}, {pid=440, ppid=436, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=backup:system, 
region=5af765b953f204142ca7f0f7e506d19c}]
{noformat}
Seems like some combination of meta's state being wrong and SCP not handling 
all of the regions that were on RS6 is the problem. I'm not sure what is the 
exact culprit though. Need to read code now.

> info:servername and info:sn inconsistent for OPEN region
> --------------------------------------------------------
>
>                 Key: HBASE-20792
>                 URL: https://issues.apache.org/jira/browse/HBASE-20792
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Blocker
>             Fix For: 2.0.2
>
>
> Next problem we've run into after HBASE-20752 and HBASE-20708
> After a rolling restart of a cluster, we'll see situations where a collection 
> of regions will simply not be assigned out to the RS. I was able to reproduce 
> this my mimic the restart patterns our tests do internally (ignore whether 
> this is the best way to restart nodes for now :)). The general pattern is 
> this:
> {code:java}
> for rs in regionservers:
>   stop(server, rs, RS)
> for master in masters:
>   stop(server, master, MASTER)
> sleep(15)
> for master in masters:
>   start(server, master, MASTER)
> for rs in regionservers:
>   start(server, rs, RS){code}
> Looking at meta, we can see why the Master is ignoring some regions:
> {noformat}
>  test                                                        
> column=table:state, timestamp=1529871718998, value=\x08\x00
>  test,,1529871718122.0297f680df6dc0166a44f9536346268e.       
> column=info:regioninfo, timestamp=1529967103390, value={ENCODED => 
> 0297f680df6dc0166a44f9536346268e, NAME => 
> 'test,,1529871718122.0297f680df6dc0166a44f9536346268e.', STARTKEY
>                                                              => '', ENDKEY => 
> ''}
>  test,,1529871718122.0297f680df6dc0166a44f9536346268e.       
> column=info:seqnumDuringOpen, timestamp=1529967103390, 
> value=\x00\x00\x00\x00\x00\x00\x00*
>  test,,1529871718122.0297f680df6dc0166a44f9536346268e.       
> column=info:server, timestamp=1529967103390, 
> value=ctr-e138-1518143905142-378097-02-000012.hwx.site:16020
>  test,,1529871718122.0297f680df6dc0166a44f9536346268e.       
> column=info:serverstartcode, timestamp=1529967103390, value=1529966776248
>  test,,1529871718122.0297f680df6dc0166a44f9536346268e.       column=info:sn, 
> timestamp=1529967096482, 
> value=ctr-e138-1518143905142-378097-02-000006.hwx.site,16020,1529966755170
>  test,,1529871718122.0297f680df6dc0166a44f9536346268e.       
> column=info:state, timestamp=1529967103390, value=OPEN{noformat}
> The region is marked as {{OPEN}}. The master doesn't know any better. 
> However, the interesting bit is that {{info:server}} and {{info:sn}} are 
> inconsistent (which, according to the javadoc should not be possible for an 
> {{OPEN}} region).{{}}
> This doesn't happen every time, but I caught it yesterday on the 2nd or 3rd 
> attempt, so I'm hopeful it's not a bear to repro.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to