[ https://issues.apache.org/jira/browse/HBASE-19998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364511#comment-16364511 ]
stack commented on HBASE-19998: ------------------------------- Debug turned up the culprit 'mover': {code} 2018-02-14 12:35:07,975 DEBUG [PEWorker-6] procedure.ServerCrashProcedure(217): Splitting WALs pid=9, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS; ServerCrashProcedure server=asf903.gq1.ygridcore.net,56387,1518611700720, splitWal=true, meta=false 2018-02-14 12:35:07,978 INFO [Thread-213] assignment.MoveRegionProcedure(57): REMOVE java.lang.Throwable: REMOVE: Just to see who is calling Move!!! at org.apache.hadoop.hbase.master.assignment.MoveRegionProcedure.<init>(MoveRegionProcedure.java:57) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.createMoveRegionProcedure(AssignmentManager.java:740) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.moveAsync(AssignmentManager.java:545) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.lambda$checkIfShouldMoveSystemRegionAsync$0(AssignmentManager.java:481) at java.lang.Thread.run(Thread.java:748) 2018-02-14 12:35:07,986 DEBUG [PEWorker-6] master.MasterWalManager(257): Renamed region directory: hdfs://localhost:39302/user/jenkins/test-data/4b64143d-6b3f-4516-9efb-0c6c1a438493/WALs/asf903.gq1.ygridcore.net,56387,1518611700720-splitting 2018-02-14 12:35:07,986 INFO [PEWorker-6] master.SplitLogManager(461): dead splitlog workers [asf903.gq1.ygridcore.net,56387,1518611700720] 2018-02-14 12:35:07,986 DEBUG [Time-limited test-EventThread] procedure2.ProcedureExecutor(868): Stored pid=10, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure server=asf903.gq1.ygridcore.net,51305,1518611700792, splitWal=true, meta=true 2018-02-14 12:35:07,987 DEBUG [Time-limited test-EventThread] assignment.AssignmentManager(1382): Added=asf903.gq1.ygridcore.net,51305,1518611700792 to dead servers, submitted shutdown handler to be executed meta=true 2018-02-14 12:35:07,988 DEBUG [Time-limited test-EventThread] zookeeper.ZKUtil(355): regionserver:35367-0x161944f90b6000c, quorum=localhost:59400, baseZNode=/hbase Set watcher on existing znode=/hbase/rs/asf903.gq1.ygridcore.net,35367,1518611707634 2018-02-14 12:35:07,988 INFO [PEWorker-8] procedure.ServerCrashProcedure(118): Start pid=10, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure server=asf903.gq1.ygridcore.net,51305,1518611700792, splitWal=true, meta=true 2018-02-14 12:35:07,992 INFO [PEWorker-6] master.SplitLogManager(241): Started splitting 1 logs in [hdfs://localhost:39302/user/jenkins/test-data/4b64143d-6b3f-4516-9efb-0c6c1a438493/WALs/asf903.gq1.ygridcore.net,56387,1518611700720-splitting] for [asf903.gq1.ygridcore.net,56387,1518611700720] 2018-02-14 12:35:07,999 DEBUG [Thread-213] procedure2.ProcedureExecutor(868): Stored pid=11, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure hri=hbase:meta,,1.1588230740, source=asf903.gq1.ygridcore.net,51305,1518611700792, destination= 2018-02-14 12:35:08,003 DEBUG [PEWorker-9] procedure.ServerCrashProcedure(192): pid=10, state=RUNNABLE:SERVER_CRASH_PROCESS_META; ServerCrashProcedure server=asf903.gq1.ygridcore.net,51305,1518611700792, splitWal=true, meta=true; processing hbase:meta 2018-02-14 12:35:08,003 INFO [PEWorker-10] procedure.MasterProcedureScheduler(813): pid=11, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure hri=hbase:meta,,1.1588230740, source=asf903.gq1.ygridcore.net,51305,1518611700792, destination=, table=hbase:meta, hbase:meta,,1.1588230740 2018-02-14 12:35:08,003 INFO [PEWorker-9] procedure2.ProcedureExecutor(1498): Initialized subprocedures=[{pid=12, ppid=10, state=RUNNABLE:RECOVER_META_SPLIT_LOGS; RecoverMetaProcedure failedMetaServer=asf903.gq1.ygridcore.net,51305,1518611700792, splitWal=true}] 2018-02-14 12:35:08,005 INFO [PEWorker-10] procedure2.ProcedureExecutor(1498): Initialized subprocedures=[{pid=13, ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=hbase:meta, region=1588230740, server=asf903.gq1.ygridcore.net,51305,1518611700792}] 2018-02-14 12:35:08,006 INFO [Thread-213] assignment.MoveRegionProcedure(57): REMOVE java.lang.Throwable: REMOVE: Just to see who is calling Move!!! at org.apache.hadoop.hbase.master.assignment.MoveRegionProcedure.<init>(MoveRegionProcedure.java:57) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.createMoveRegionProcedure(AssignmentManager.java:740) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.moveAsync(AssignmentManager.java:545) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.lambda$checkIfShouldMoveSystemRegionAsync$0(AssignmentManager.java:488) at java.lang.Thread.run(Thread.java:748) {code} Its the facility added over in HBASE-18494 that is cutting in. In this case two RegionServers are killed and a new one started. This HBASE-18494 facility cuts in w/ its moves though ServerCrashProcedure is taking care of it. There is that and then the failure to give up if on unassign we timeout against the remote server. [~uagashe] FYI. > Flakey TestVisibilityLabelsWithDefaultVisLabelService > ----------------------------------------------------- > > Key: HBASE-19998 > URL: https://issues.apache.org/jira/browse/HBASE-19998 > Project: HBase > Issue Type: Bug > Components: flakey, test > Reporter: stack > Assignee: stack > Priority: Major > Fix For: 2.0.0-beta-2 > > Attachments: hbase-19988.master.001.patch > > > This is a good one. Its a timeout and though it has lots of test methods, the > problem is one of them gets stuck. The test method kills a RegionServers then > starts a new one. Usually all works out fine but the odd time there is this > unexplained MOVE that gets interjected just as ServerCrashProcedure starts > up. hbase:meta gets stuck (perhaps this is what is being referred to here: > https://issues.apache.org/jira/browse/HBASE-19929?focusedCommentId=16356906&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16356906). > It is trying to run the MOVE by first unassigning from the server that has > just crashed. It never succeeds. Need to fix this. Need to figure where these > Move operations are coming from too. Let me add some debug. > See here how we are well into ServerCrashProcedure... and then two MOVEs > cut-in... for hbase:meta and for namespace: > {code} > .... > 2018-02-14 02:35:19,806 DEBUG [PEWorker-6] > procedure.ServerCrashProcedure(192): pid=10, > state=RUNNABLE:SERVER_CRASH_PROCESS_META; ServerCrashProcedure > server=asf903.gq1.ygridcore.net,59608,1518575711969, splitWal=true, > meta=true; Processing hbase:meta that was on > asf903.gq1.ygridcore.net,59608,1518575711969 > 2018-02-14 02:35:19,807 INFO [PEWorker-6] > procedure2.ProcedureExecutor(1498): Initialized subprocedures=[{pid=12, > ppid=10, state=RUNNABLE:RECOVER_META_SPLIT_LOGS; RecoverMetaProcedure > failedMetaServer=asf903.gq1.ygridcore.net,59608,1518575711969, splitWal=true}] > 2018-02-14 02:35:19,811 DEBUG [Thread-214] procedure2.ProcedureExecutor(868): > Stored pid=11, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure > hri=hbase:meta,,1.1588230740, > source=asf903.gq1.ygridcore.net,59608,1518575711969, destination= > 2018-02-14 02:35:19,813 INFO [PEWorker-8] > procedure.MasterProcedureScheduler(813): pid=11, > state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure > hri=hbase:meta,,1.1588230740, > source=asf903.gq1.ygridcore.net,59608,1518575711969, destination= hbase:meta > hbase:meta,,1.1588230740 > 2018-02-14 02:35:19,814 INFO [PEWorker-8] > procedure2.ProcedureExecutor(1498): Initialized subprocedures=[{pid=14, > ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=hbase:meta, region=1588230740, > server=asf903.gq1.ygridcore.net,59608,1518575711969}] > 2018-02-14 02:35:19,831 DEBUG [Thread-214] procedure2.ProcedureExecutor(868): > Stored pid=13, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure > hri=hbase:namespace,,1518575716296.e52a160b3f3a57ab50d710eba62d9b15., > source=asf903.gq1.ygridcore.net,59608,1518575711969, destination= > 2018-02-14 02:35:19,833 INFO [PEWorker-10] > procedure.MasterProcedureScheduler(813): pid=13, > state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure > hri=hbase:namespace,,1518575716296.e52a160b3f3a57ab50d710eba62d9b15., > source=asf903.gq1.ygridcore.net,59608,1518575711969, destination= > hbase:namespace > hbase:namespace,,1518575716296.e52a160b3f3a57ab50d710eba62d9b15. > 2018-02-14 02:35:19,837 INFO [PEWorker-10] > procedure2.ProcedureExecutor(1498): Initialized subprocedures=[{pid=15, > ppid=13, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=hbase:namespace, region=e52a160b3f3a57ab50d710eba62d9b15, > server=asf903.gq1.ygridcore.net,59608,1518575711969}] > .... > {code} > Here is the failure of the unassign: > {code} > 2018-02-14 02:35:19,944 WARN [PEWorker-8] > assignment.RegionTransitionProcedure(187): Remote call failed pid=14, > ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=hbase:meta, region=1588230740, > server=asf903.gq1.ygridcore.net,59608,1518575711969; rit=CLOSING, > location=asf903.gq1.ygridcore.net,59608,1518575711969; exception=pid=14, > ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure > table=hbase:meta, region=1588230740, > server=asf903.gq1.ygridcore.net,59608,1518575711969 to > asf903.gq1.ygridcore.net,59608,1518575711969 > 2018-02-14 02:35:19,945 WARN [PEWorker-8] assignment.UnassignProcedure(245): > Expiring server pid=14, ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; > UnassignProcedure table=hbase:meta, region=1588230740, > server=asf903.gq1.ygridcore.net,59608,1518575711969; rit=CLOSING, > location=asf903.gq1.ygridcore.net,59608,1518575711969, > exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException: > pid=14, ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; > UnassignProcedure table=hbase:meta, region=1588230740, > server=asf903.gq1.ygridcore.net,59608,1518575711969 to > asf903.gq1.ygridcore.net,59608,1518575711969 > 2018-02-14 02:35:19,945 WARN [PEWorker-8] master.ServerManager(579): > Expiration of asf903.gq1.ygridcore.net,59608,1518575711969 but server > shutdown already in progress > 2018-02-14 02:35:19,949 DEBUG [RS-EventLoopGroup-4-22] ipc.FailedServers(56): > Added failed server with address asf903.gq1.ygridcore.net/67.195.81.147:59608 > to list caused by > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > syscall:getsockopt(..) failed: Connection refused: > asf903.gq1.ygridcore.net/67.195.81.147:59608 > 2018-02-14 02:35:20,064 DEBUG [MultiHConnection-shared--pool8-t5] > ipc.AbstractRpcClient(349): Not trying to connect to > asf903.gq1.ygridcore.net/67.195.81.147:59608 this server is in the failed > servers list > ... > {code} > ... and we keep trying: > {code} > 2018-02-14 02:35:22,093 INFO [MultiHConnection-shared--pool8-t5] > client.AsyncRequestFutureImpl(767): id=6, table=hbase:meta, attempt=6/46 > failed=1ops, last exception=java.net.ConnectException: Call to > asf903.gq1.ygridcore.net/67.195.81.147:59608 failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > syscall:getsockopt(..) failed: Connection refused: > asf903.gq1.ygridcore.net/67.195.81.147:59608 on > asf903.gq1.ygridcore.net,59608,1518575711969, tracking started null, retrying > after=2000ms, replay=1ops > 2018-02-14 02:35:24,095 DEBUG [MultiHConnection-shared--pool8-t5] > ipc.NettyRpcConnection(254): Connecting to > asf903.gq1.ygridcore.net/67.195.81.147:59608 > 2018-02-14 02:35:24,110 DEBUG [RS-EventLoopGroup-4-24] ipc.FailedServers(56): > Added failed server with address asf903.gq1.ygridcore.net/67.195.81.147:59608 > to list caused by > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > syscall:getsockopt(..) failed: Connection refused: > asf903.gq1.ygridcore.net/67.195.81.147:59608 > 2018-02-14 02:35:24,112 INFO [MultiHConnection-shared--pool8-t5] > client.AsyncRequestFutureImpl(767): id=6, table=hbase:meta, attempt=7/46 > failed=1ops, last exception=java.net.ConnectException: Call to > asf903.gq1.ygridcore.net/67.195.81.147:59608 failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > syscall:getsockopt(..) failed: Connection refused: > asf903.gq1.ygridcore.net/67.195.81.147:59608 on > asf903.gq1.ygridcore.net,59608,1518575711969, tracking started null, retrying > after=4005ms, replay=1ops > 2018-02-14 02:35:28,120 DEBUG [MultiHConnection-shared--pool8-t5] > ipc.NettyRpcConnection(254): Connecting to > asf903.gq1.ygridcore.net/67.195.81.147:59608 > {code} > Test set: > org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService > ------------------------------------------------------------------------------- > Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 600.83 s <<< > FAILURE! - in > org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService > org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService > Time elapsed: 583.188 s <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 600 > seconds > at > org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService.testAddVisibilityLabelsOnRSRestart(TestVisibilityLabelsWithDefaultVisLabelService.java:126) > org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService > Time elapsed: 583.227 s <<< ERROR! > java.lang.Exception: Appears to be stuck in thread RS-EventLoopGroup-7-2 -- This message was sent by Atlassian JIRA (v7.6.3#76005)