[ 
https://issues.apache.org/jira/browse/HBASE-19998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364511#comment-16364511
 ] 

stack commented on HBASE-19998:
-------------------------------

Debug turned up the culprit 'mover':

{code}
2018-02-14 12:35:07,975 DEBUG [PEWorker-6] procedure.ServerCrashProcedure(217): 
Splitting WALs pid=9, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; 
ServerCrashProcedure server=asf903.gq1.ygridcore.net,56387,1518611700720, 
splitWal=true, meta=false
2018-02-14 12:35:07,978 INFO  [Thread-213] assignment.MoveRegionProcedure(57): 
REMOVE
java.lang.Throwable: REMOVE: Just to see who is calling Move!!!
        at 
org.apache.hadoop.hbase.master.assignment.MoveRegionProcedure.<init>(MoveRegionProcedure.java:57)
        at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createMoveRegionProcedure(AssignmentManager.java:740)
        at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.moveAsync(AssignmentManager.java:545)
        at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.lambda$checkIfShouldMoveSystemRegionAsync$0(AssignmentManager.java:481)
        at java.lang.Thread.run(Thread.java:748)
2018-02-14 12:35:07,986 DEBUG [PEWorker-6] master.MasterWalManager(257): 
Renamed region directory: 
hdfs://localhost:39302/user/jenkins/test-data/4b64143d-6b3f-4516-9efb-0c6c1a438493/WALs/asf903.gq1.ygridcore.net,56387,1518611700720-splitting
2018-02-14 12:35:07,986 INFO  [PEWorker-6] master.SplitLogManager(461): dead 
splitlog workers [asf903.gq1.ygridcore.net,56387,1518611700720]
2018-02-14 12:35:07,986 DEBUG [Time-limited test-EventThread] 
procedure2.ProcedureExecutor(868): Stored pid=10, 
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=asf903.gq1.ygridcore.net,51305,1518611700792, splitWal=true, meta=true
2018-02-14 12:35:07,987 DEBUG [Time-limited test-EventThread] 
assignment.AssignmentManager(1382): 
Added=asf903.gq1.ygridcore.net,51305,1518611700792 to dead servers, submitted 
shutdown handler to be executed meta=true
2018-02-14 12:35:07,988 DEBUG [Time-limited test-EventThread] 
zookeeper.ZKUtil(355): regionserver:35367-0x161944f90b6000c, 
quorum=localhost:59400, baseZNode=/hbase Set watcher on existing 
znode=/hbase/rs/asf903.gq1.ygridcore.net,35367,1518611707634
2018-02-14 12:35:07,988 INFO  [PEWorker-8] procedure.ServerCrashProcedure(118): 
Start pid=10, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=asf903.gq1.ygridcore.net,51305,1518611700792, splitWal=true, meta=true
2018-02-14 12:35:07,992 INFO  [PEWorker-6] master.SplitLogManager(241): Started 
splitting 1 logs in 
[hdfs://localhost:39302/user/jenkins/test-data/4b64143d-6b3f-4516-9efb-0c6c1a438493/WALs/asf903.gq1.ygridcore.net,56387,1518611700720-splitting]
 for [asf903.gq1.ygridcore.net,56387,1518611700720]
2018-02-14 12:35:07,999 DEBUG [Thread-213] procedure2.ProcedureExecutor(868): 
Stored pid=11, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
hri=hbase:meta,,1.1588230740, 
source=asf903.gq1.ygridcore.net,51305,1518611700792, destination=
2018-02-14 12:35:08,003 DEBUG [PEWorker-9] procedure.ServerCrashProcedure(192): 
pid=10, state=RUNNABLE:SERVER_CRASH_PROCESS_META; ServerCrashProcedure 
server=asf903.gq1.ygridcore.net,51305,1518611700792, splitWal=true, meta=true; 
processing hbase:meta
2018-02-14 12:35:08,003 INFO  [PEWorker-10] 
procedure.MasterProcedureScheduler(813): pid=11, 
state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
hri=hbase:meta,,1.1588230740, 
source=asf903.gq1.ygridcore.net,51305,1518611700792, destination=, 
table=hbase:meta, hbase:meta,,1.1588230740
2018-02-14 12:35:08,003 INFO  [PEWorker-9] procedure2.ProcedureExecutor(1498): 
Initialized subprocedures=[{pid=12, ppid=10, 
state=RUNNABLE:RECOVER_META_SPLIT_LOGS; RecoverMetaProcedure 
failedMetaServer=asf903.gq1.ygridcore.net,51305,1518611700792, splitWal=true}]
2018-02-14 12:35:08,005 INFO  [PEWorker-10] procedure2.ProcedureExecutor(1498): 
Initialized subprocedures=[{pid=13, ppid=11, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=hbase:meta, 
region=1588230740, server=asf903.gq1.ygridcore.net,51305,1518611700792}]
2018-02-14 12:35:08,006 INFO  [Thread-213] assignment.MoveRegionProcedure(57): 
REMOVE
java.lang.Throwable: REMOVE: Just to see who is calling Move!!!
        at 
org.apache.hadoop.hbase.master.assignment.MoveRegionProcedure.<init>(MoveRegionProcedure.java:57)
        at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.createMoveRegionProcedure(AssignmentManager.java:740)
        at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.moveAsync(AssignmentManager.java:545)
        at 
org.apache.hadoop.hbase.master.assignment.AssignmentManager.lambda$checkIfShouldMoveSystemRegionAsync$0(AssignmentManager.java:488)
        at java.lang.Thread.run(Thread.java:748)
{code}

Its the facility added over in HBASE-18494 that is cutting in. In this case two 
RegionServers are killed and a new one started. This HBASE-18494 facility cuts 
in w/ its moves though ServerCrashProcedure is taking care of it. There is that 
and then the failure to give up if on unassign we timeout against the remote 
server. [~uagashe] FYI. 

> Flakey TestVisibilityLabelsWithDefaultVisLabelService
> -----------------------------------------------------
>
>                 Key: HBASE-19998
>                 URL: https://issues.apache.org/jira/browse/HBASE-19998
>             Project: HBase
>          Issue Type: Bug
>          Components: flakey, test
>            Reporter: stack
>            Assignee: stack
>            Priority: Major
>             Fix For: 2.0.0-beta-2
>
>         Attachments: hbase-19988.master.001.patch
>
>
> This is a good one. Its a timeout and though it has lots of test methods, the 
> problem is one of them gets stuck. The test method kills a RegionServers then 
> starts a new one. Usually all works out fine but the odd time there is this 
> unexplained MOVE that gets interjected just as ServerCrashProcedure starts 
> up. hbase:meta gets stuck (perhaps this is what is being referred to here: 
> https://issues.apache.org/jira/browse/HBASE-19929?focusedCommentId=16356906&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16356906).
>  It is trying to run the MOVE by first unassigning from the server that has 
> just crashed. It never succeeds. Need to fix this. Need to figure where these 
> Move operations are coming from too. Let me add some debug. 
> See here how we are well into ServerCrashProcedure... and then two MOVEs 
> cut-in... for hbase:meta and for namespace:
> {code}
> ....
> 2018-02-14 02:35:19,806 DEBUG [PEWorker-6] 
> procedure.ServerCrashProcedure(192): pid=10, 
> state=RUNNABLE:SERVER_CRASH_PROCESS_META; ServerCrashProcedure 
> server=asf903.gq1.ygridcore.net,59608,1518575711969, splitWal=true, 
> meta=true; Processing hbase:meta that was on 
> asf903.gq1.ygridcore.net,59608,1518575711969
> 2018-02-14 02:35:19,807 INFO  [PEWorker-6] 
> procedure2.ProcedureExecutor(1498): Initialized subprocedures=[{pid=12, 
> ppid=10, state=RUNNABLE:RECOVER_META_SPLIT_LOGS; RecoverMetaProcedure 
> failedMetaServer=asf903.gq1.ygridcore.net,59608,1518575711969, splitWal=true}]
> 2018-02-14 02:35:19,811 DEBUG [Thread-214] procedure2.ProcedureExecutor(868): 
> Stored pid=11, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
> hri=hbase:meta,,1.1588230740, 
> source=asf903.gq1.ygridcore.net,59608,1518575711969, destination=
> 2018-02-14 02:35:19,813 INFO  [PEWorker-8] 
> procedure.MasterProcedureScheduler(813): pid=11, 
> state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
> hri=hbase:meta,,1.1588230740, 
> source=asf903.gq1.ygridcore.net,59608,1518575711969, destination= hbase:meta 
> hbase:meta,,1.1588230740
> 2018-02-14 02:35:19,814 INFO  [PEWorker-8] 
> procedure2.ProcedureExecutor(1498): Initialized subprocedures=[{pid=14, 
> ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=hbase:meta, region=1588230740, 
> server=asf903.gq1.ygridcore.net,59608,1518575711969}]
> 2018-02-14 02:35:19,831 DEBUG [Thread-214] procedure2.ProcedureExecutor(868): 
> Stored pid=13, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
> hri=hbase:namespace,,1518575716296.e52a160b3f3a57ab50d710eba62d9b15., 
> source=asf903.gq1.ygridcore.net,59608,1518575711969, destination=
> 2018-02-14 02:35:19,833 INFO  [PEWorker-10] 
> procedure.MasterProcedureScheduler(813): pid=13, 
> state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure 
> hri=hbase:namespace,,1518575716296.e52a160b3f3a57ab50d710eba62d9b15., 
> source=asf903.gq1.ygridcore.net,59608,1518575711969, destination= 
> hbase:namespace 
> hbase:namespace,,1518575716296.e52a160b3f3a57ab50d710eba62d9b15.
> 2018-02-14 02:35:19,837 INFO  [PEWorker-10] 
> procedure2.ProcedureExecutor(1498): Initialized subprocedures=[{pid=15, 
> ppid=13, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=hbase:namespace, region=e52a160b3f3a57ab50d710eba62d9b15, 
> server=asf903.gq1.ygridcore.net,59608,1518575711969}]
> ....
> {code}
> Here is the failure of the unassign:
> {code}
> 2018-02-14 02:35:19,944 WARN  [PEWorker-8] 
> assignment.RegionTransitionProcedure(187): Remote call failed pid=14, 
> ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=hbase:meta, region=1588230740, 
> server=asf903.gq1.ygridcore.net,59608,1518575711969; rit=CLOSING, 
> location=asf903.gq1.ygridcore.net,59608,1518575711969; exception=pid=14, 
> ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure 
> table=hbase:meta, region=1588230740, 
> server=asf903.gq1.ygridcore.net,59608,1518575711969 to 
> asf903.gq1.ygridcore.net,59608,1518575711969
> 2018-02-14 02:35:19,945 WARN  [PEWorker-8] assignment.UnassignProcedure(245): 
> Expiring server pid=14, ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> UnassignProcedure table=hbase:meta, region=1588230740, 
> server=asf903.gq1.ygridcore.net,59608,1518575711969; rit=CLOSING, 
> location=asf903.gq1.ygridcore.net,59608,1518575711969, 
> exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
>  pid=14, ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; 
> UnassignProcedure table=hbase:meta, region=1588230740, 
> server=asf903.gq1.ygridcore.net,59608,1518575711969 to 
> asf903.gq1.ygridcore.net,59608,1518575711969
> 2018-02-14 02:35:19,945 WARN  [PEWorker-8] master.ServerManager(579): 
> Expiration of asf903.gq1.ygridcore.net,59608,1518575711969 but server 
> shutdown already in progress
> 2018-02-14 02:35:19,949 DEBUG [RS-EventLoopGroup-4-22] ipc.FailedServers(56): 
> Added failed server with address asf903.gq1.ygridcore.net/67.195.81.147:59608 
> to list caused by 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  syscall:getsockopt(..) failed: Connection refused: 
> asf903.gq1.ygridcore.net/67.195.81.147:59608
> 2018-02-14 02:35:20,064 DEBUG [MultiHConnection-shared--pool8-t5] 
> ipc.AbstractRpcClient(349): Not trying to connect to 
> asf903.gq1.ygridcore.net/67.195.81.147:59608 this server is in the failed 
> servers list
> ...
> {code}
> ... and we keep trying:
> {code}
> 2018-02-14 02:35:22,093 INFO  [MultiHConnection-shared--pool8-t5] 
> client.AsyncRequestFutureImpl(767): id=6, table=hbase:meta, attempt=6/46 
> failed=1ops, last exception=java.net.ConnectException: Call to 
> asf903.gq1.ygridcore.net/67.195.81.147:59608 failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  syscall:getsockopt(..) failed: Connection refused: 
> asf903.gq1.ygridcore.net/67.195.81.147:59608 on 
> asf903.gq1.ygridcore.net,59608,1518575711969, tracking started null, retrying 
> after=2000ms, replay=1ops
> 2018-02-14 02:35:24,095 DEBUG [MultiHConnection-shared--pool8-t5] 
> ipc.NettyRpcConnection(254): Connecting to 
> asf903.gq1.ygridcore.net/67.195.81.147:59608
> 2018-02-14 02:35:24,110 DEBUG [RS-EventLoopGroup-4-24] ipc.FailedServers(56): 
> Added failed server with address asf903.gq1.ygridcore.net/67.195.81.147:59608 
> to list caused by 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  syscall:getsockopt(..) failed: Connection refused: 
> asf903.gq1.ygridcore.net/67.195.81.147:59608
> 2018-02-14 02:35:24,112 INFO  [MultiHConnection-shared--pool8-t5] 
> client.AsyncRequestFutureImpl(767): id=6, table=hbase:meta, attempt=7/46 
> failed=1ops, last exception=java.net.ConnectException: Call to 
> asf903.gq1.ygridcore.net/67.195.81.147:59608 failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  syscall:getsockopt(..) failed: Connection refused: 
> asf903.gq1.ygridcore.net/67.195.81.147:59608 on 
> asf903.gq1.ygridcore.net,59608,1518575711969, tracking started null, retrying 
> after=4005ms, replay=1ops
> 2018-02-14 02:35:28,120 DEBUG [MultiHConnection-shared--pool8-t5] 
> ipc.NettyRpcConnection(254): Connecting to 
> asf903.gq1.ygridcore.net/67.195.81.147:59608
> {code}
> Test set: 
> org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService
> -------------------------------------------------------------------------------
> Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 600.83 s <<< 
> FAILURE! - in 
> org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService
> org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService
>   Time elapsed: 583.188 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 600 
> seconds
>       at 
> org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService.testAddVisibilityLabelsOnRSRestart(TestVisibilityLabelsWithDefaultVisLabelService.java:126)
> org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService
>   Time elapsed: 583.227 s  <<< ERROR!
> java.lang.Exception: Appears to be stuck in thread RS-EventLoopGroup-7-2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to