stack created HBASE-19998:
-----------------------------

             Summary: Flakey TestVisibilityLabelsWithDefaultVisLabelService
                 Key: HBASE-19998
                 URL: https://issues.apache.org/jira/browse/HBASE-19998
             Project: HBase
          Issue Type: Bug
          Components: flakey, test
            Reporter: stack
            Assignee: stack
             Fix For: 2.0.0-beta-2


This is a good one. Its a timeout and though it has lots of test methods, the 
problem is one of them gets stuck. The test method kills a RegionServers then 
starts a new one. Usually all works out fine but the odd time there is this 
unexplained MOVE that gets interjected just as ServerCrashProcedure starts up. 
hbase:meta gets stuck (perhaps this is what is being referred to here: 
https://issues.apache.org/jira/browse/HBASE-19929?focusedCommentId=16356906&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16356906).
 It is trying to run the MOVE by first unassigning from the server that has 
just crashed. It never succeeds. Need to fix this. Need to figure where these 
Move operations are coming from too. Let me add some debug. 

See here how we are well into ServerCrashProcedure... and then two MOVEs 
cut-in... for hbase:meta and for namespace:

{code}
....
2018-02-14 02:35:19,806 DEBUG [PEWorker-6] procedure.ServerCrashProcedure(192): 
pid=10, state=RUNNABLE:SERVER_CRASH_PROCESS_META; ServerCrashProcedure 
server=asf903.gq1.ygridcore.net,59608,1518575711969, splitWal=true, meta=true; 
Processing hbase:meta that was on asf903.gq1.ygridcore.net,59608,1518575711969
2018-02-14 02:35:19,807 INFO  [PEWorker-6] procedure2.ProcedureExecutor(1498): 
Initialized subprocedures=[{pid=12, ppid=10, 
state=RUNNABLE:RECOVER_META_SPLIT_LOGS; RecoverMetaProcedure 
failedMetaServer=asf903.gq1.ygridcore.net,59608,1518575711969, splitWal=true}]
2018-02-14 02:35:19,811 DEBUG [Thread-214] procedure2.ProcedureExecutor(868): 
Stored pid=11, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
hri=hbase:meta,,1.1588230740, 
source=asf903.gq1.ygridcore.net,59608,1518575711969, destination=
2018-02-14 02:35:19,813 INFO  [PEWorker-8] 
procedure.MasterProcedureScheduler(813): pid=11, 
state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
hri=hbase:meta,,1.1588230740, 
source=asf903.gq1.ygridcore.net,59608,1518575711969, destination= hbase:meta 
hbase:meta,,1.1588230740
2018-02-14 02:35:19,814 INFO  [PEWorker-8] procedure2.ProcedureExecutor(1498): 
Initialized subprocedures=[{pid=14, ppid=11, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=hbase:meta, 
region=1588230740, server=asf903.gq1.ygridcore.net,59608,1518575711969}]
2018-02-14 02:35:19,831 DEBUG [Thread-214] procedure2.ProcedureExecutor(868): 
Stored pid=13, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
hri=hbase:namespace,,1518575716296.e52a160b3f3a57ab50d710eba62d9b15., 
source=asf903.gq1.ygridcore.net,59608,1518575711969, destination=
2018-02-14 02:35:19,833 INFO  [PEWorker-10] 
procedure.MasterProcedureScheduler(813): pid=13, 
state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
hri=hbase:namespace,,1518575716296.e52a160b3f3a57ab50d710eba62d9b15., 
source=asf903.gq1.ygridcore.net,59608,1518575711969, destination= 
hbase:namespace hbase:namespace,,1518575716296.e52a160b3f3a57ab50d710eba62d9b15.
2018-02-14 02:35:19,837 INFO  [PEWorker-10] procedure2.ProcedureExecutor(1498): 
Initialized subprocedures=[{pid=15, ppid=13, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
table=hbase:namespace, region=e52a160b3f3a57ab50d710eba62d9b15, 
server=asf903.gq1.ygridcore.net,59608,1518575711969}]
....
{code}

Here is the failure of the unassign:

{code}
2018-02-14 02:35:19,944 WARN  [PEWorker-8] 
assignment.RegionTransitionProcedure(187): Remote call failed pid=14, ppid=11, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=hbase:meta, 
region=1588230740, server=asf903.gq1.ygridcore.net,59608,1518575711969; 
rit=CLOSING, location=asf903.gq1.ygridcore.net,59608,1518575711969; 
exception=pid=14, ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
UnassignProcedure table=hbase:meta, region=1588230740, 
server=asf903.gq1.ygridcore.net,59608,1518575711969 to 
asf903.gq1.ygridcore.net,59608,1518575711969
2018-02-14 02:35:19,945 WARN  [PEWorker-8] assignment.UnassignProcedure(245): 
Expiring server pid=14, ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
UnassignProcedure table=hbase:meta, region=1588230740, 
server=asf903.gq1.ygridcore.net,59608,1518575711969; rit=CLOSING, 
location=asf903.gq1.ygridcore.net,59608,1518575711969, 
exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
 pid=14, ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
table=hbase:meta, region=1588230740, 
server=asf903.gq1.ygridcore.net,59608,1518575711969 to 
asf903.gq1.ygridcore.net,59608,1518575711969
2018-02-14 02:35:19,945 WARN  [PEWorker-8] master.ServerManager(579): 
Expiration of asf903.gq1.ygridcore.net,59608,1518575711969 but server shutdown 
already in progress
2018-02-14 02:35:19,949 DEBUG [RS-EventLoopGroup-4-22] ipc.FailedServers(56): 
Added failed server with address asf903.gq1.ygridcore.net/67.195.81.147:59608 
to list caused by 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 syscall:getsockopt(..) failed: Connection refused: 
asf903.gq1.ygridcore.net/67.195.81.147:59608
2018-02-14 02:35:20,064 DEBUG [MultiHConnection-shared--pool8-t5] 
ipc.AbstractRpcClient(349): Not trying to connect to 
asf903.gq1.ygridcore.net/67.195.81.147:59608 this server is in the failed 
servers list
...
{code}

... and we keep trying:
{code}
2018-02-14 02:35:22,093 INFO  [MultiHConnection-shared--pool8-t5] 
client.AsyncRequestFutureImpl(767): id=6, table=hbase:meta, attempt=6/46 
failed=1ops, last exception=java.net.ConnectException: Call to 
asf903.gq1.ygridcore.net/67.195.81.147:59608 failed on connection exception: 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 syscall:getsockopt(..) failed: Connection refused: 
asf903.gq1.ygridcore.net/67.195.81.147:59608 on 
asf903.gq1.ygridcore.net,59608,1518575711969, tracking started null, retrying 
after=2000ms, replay=1ops
2018-02-14 02:35:24,095 DEBUG [MultiHConnection-shared--pool8-t5] 
ipc.NettyRpcConnection(254): Connecting to 
asf903.gq1.ygridcore.net/67.195.81.147:59608
2018-02-14 02:35:24,110 DEBUG [RS-EventLoopGroup-4-24] ipc.FailedServers(56): 
Added failed server with address asf903.gq1.ygridcore.net/67.195.81.147:59608 
to list caused by 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 syscall:getsockopt(..) failed: Connection refused: 
asf903.gq1.ygridcore.net/67.195.81.147:59608
2018-02-14 02:35:24,112 INFO  [MultiHConnection-shared--pool8-t5] 
client.AsyncRequestFutureImpl(767): id=6, table=hbase:meta, attempt=7/46 
failed=1ops, last exception=java.net.ConnectException: Call to 
asf903.gq1.ygridcore.net/67.195.81.147:59608 failed on connection exception: 
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 syscall:getsockopt(..) failed: Connection refused: 
asf903.gq1.ygridcore.net/67.195.81.147:59608 on 
asf903.gq1.ygridcore.net,59608,1518575711969, tracking started null, retrying 
after=4005ms, replay=1ops
2018-02-14 02:35:28,120 DEBUG [MultiHConnection-shared--pool8-t5] 
ipc.NettyRpcConnection(254): Connecting to 
asf903.gq1.ygridcore.net/67.195.81.147:59608
{code}


Test set: 
org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService
-------------------------------------------------------------------------------
Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 600.83 s <<< 
FAILURE! - in 
org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService
org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService
  Time elapsed: 583.188 s  <<< ERROR!
org.junit.runners.model.TestTimedOutException: test timed out after 600 seconds
        at 
org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService.testAddVisibilityLabelsOnRSRestart(TestVisibilityLabelsWithDefaultVisLabelService.java:126)

org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService
  Time elapsed: 583.227 s  <<< ERROR!
java.lang.Exception: Appears to be stuck in thread RS-EventLoopGroup-7-2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to