[
https://issues.apache.org/jira/browse/HBASE-19998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HBASE-19998:
--------------------------
Attachment: hbase-19988.master.001.patch
> Flakey TestVisibilityLabelsWithDefaultVisLabelService
> -----------------------------------------------------
>
> Key: HBASE-19998
> URL: https://issues.apache.org/jira/browse/HBASE-19998
> Project: HBase
> Issue Type: Bug
> Components: flakey, test
> Reporter: stack
> Assignee: stack
> Priority: Major
> Fix For: 2.0.0-beta-2
>
> Attachments: hbase-19988.master.001.patch
>
>
> This is a good one. Its a timeout and though it has lots of test methods, the
> problem is one of them gets stuck. The test method kills a RegionServers then
> starts a new one. Usually all works out fine but the odd time there is this
> unexplained MOVE that gets interjected just as ServerCrashProcedure starts
> up. hbase:meta gets stuck (perhaps this is what is being referred to here:
> https://issues.apache.org/jira/browse/HBASE-19929?focusedCommentId=16356906&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16356906).
> It is trying to run the MOVE by first unassigning from the server that has
> just crashed. It never succeeds. Need to fix this. Need to figure where these
> Move operations are coming from too. Let me add some debug.
> See here how we are well into ServerCrashProcedure... and then two MOVEs
> cut-in... for hbase:meta and for namespace:
> {code}
> ....
> 2018-02-14 02:35:19,806 DEBUG [PEWorker-6]
> procedure.ServerCrashProcedure(192): pid=10,
> state=RUNNABLE:SERVER_CRASH_PROCESS_META; ServerCrashProcedure
> server=asf903.gq1.ygridcore.net,59608,1518575711969, splitWal=true,
> meta=true; Processing hbase:meta that was on
> asf903.gq1.ygridcore.net,59608,1518575711969
> 2018-02-14 02:35:19,807 INFO [PEWorker-6]
> procedure2.ProcedureExecutor(1498): Initialized subprocedures=[{pid=12,
> ppid=10, state=RUNNABLE:RECOVER_META_SPLIT_LOGS; RecoverMetaProcedure
> failedMetaServer=asf903.gq1.ygridcore.net,59608,1518575711969, splitWal=true}]
> 2018-02-14 02:35:19,811 DEBUG [Thread-214] procedure2.ProcedureExecutor(868):
> Stored pid=11, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure
> hri=hbase:meta,,1.1588230740,
> source=asf903.gq1.ygridcore.net,59608,1518575711969, destination=
> 2018-02-14 02:35:19,813 INFO [PEWorker-8]
> procedure.MasterProcedureScheduler(813): pid=11,
> state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure
> hri=hbase:meta,,1.1588230740,
> source=asf903.gq1.ygridcore.net,59608,1518575711969, destination= hbase:meta
> hbase:meta,,1.1588230740
> 2018-02-14 02:35:19,814 INFO [PEWorker-8]
> procedure2.ProcedureExecutor(1498): Initialized subprocedures=[{pid=14,
> ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=hbase:meta, region=1588230740,
> server=asf903.gq1.ygridcore.net,59608,1518575711969}]
> 2018-02-14 02:35:19,831 DEBUG [Thread-214] procedure2.ProcedureExecutor(868):
> Stored pid=13, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure
> hri=hbase:namespace,,1518575716296.e52a160b3f3a57ab50d710eba62d9b15.,
> source=asf903.gq1.ygridcore.net,59608,1518575711969, destination=
> 2018-02-14 02:35:19,833 INFO [PEWorker-10]
> procedure.MasterProcedureScheduler(813): pid=13,
> state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure
> hri=hbase:namespace,,1518575716296.e52a160b3f3a57ab50d710eba62d9b15.,
> source=asf903.gq1.ygridcore.net,59608,1518575711969, destination=
> hbase:namespace
> hbase:namespace,,1518575716296.e52a160b3f3a57ab50d710eba62d9b15.
> 2018-02-14 02:35:19,837 INFO [PEWorker-10]
> procedure2.ProcedureExecutor(1498): Initialized subprocedures=[{pid=15,
> ppid=13, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=hbase:namespace, region=e52a160b3f3a57ab50d710eba62d9b15,
> server=asf903.gq1.ygridcore.net,59608,1518575711969}]
> ....
> {code}
> Here is the failure of the unassign:
> {code}
> 2018-02-14 02:35:19,944 WARN [PEWorker-8]
> assignment.RegionTransitionProcedure(187): Remote call failed pid=14,
> ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=hbase:meta, region=1588230740,
> server=asf903.gq1.ygridcore.net,59608,1518575711969; rit=CLOSING,
> location=asf903.gq1.ygridcore.net,59608,1518575711969; exception=pid=14,
> ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=hbase:meta, region=1588230740,
> server=asf903.gq1.ygridcore.net,59608,1518575711969 to
> asf903.gq1.ygridcore.net,59608,1518575711969
> 2018-02-14 02:35:19,945 WARN [PEWorker-8] assignment.UnassignProcedure(245):
> Expiring server pid=14, ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
> UnassignProcedure table=hbase:meta, region=1588230740,
> server=asf903.gq1.ygridcore.net,59608,1518575711969; rit=CLOSING,
> location=asf903.gq1.ygridcore.net,59608,1518575711969,
> exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException:
> pid=14, ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH;
> UnassignProcedure table=hbase:meta, region=1588230740,
> server=asf903.gq1.ygridcore.net,59608,1518575711969 to
> asf903.gq1.ygridcore.net,59608,1518575711969
> 2018-02-14 02:35:19,945 WARN [PEWorker-8] master.ServerManager(579):
> Expiration of asf903.gq1.ygridcore.net,59608,1518575711969 but server
> shutdown already in progress
> 2018-02-14 02:35:19,949 DEBUG [RS-EventLoopGroup-4-22] ipc.FailedServers(56):
> Added failed server with address asf903.gq1.ygridcore.net/67.195.81.147:59608
> to list caused by
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
> syscall:getsockopt(..) failed: Connection refused:
> asf903.gq1.ygridcore.net/67.195.81.147:59608
> 2018-02-14 02:35:20,064 DEBUG [MultiHConnection-shared--pool8-t5]
> ipc.AbstractRpcClient(349): Not trying to connect to
> asf903.gq1.ygridcore.net/67.195.81.147:59608 this server is in the failed
> servers list
> ...
> {code}
> ... and we keep trying:
> {code}
> 2018-02-14 02:35:22,093 INFO [MultiHConnection-shared--pool8-t5]
> client.AsyncRequestFutureImpl(767): id=6, table=hbase:meta, attempt=6/46
> failed=1ops, last exception=java.net.ConnectException: Call to
> asf903.gq1.ygridcore.net/67.195.81.147:59608 failed on connection exception:
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
> syscall:getsockopt(..) failed: Connection refused:
> asf903.gq1.ygridcore.net/67.195.81.147:59608 on
> asf903.gq1.ygridcore.net,59608,1518575711969, tracking started null, retrying
> after=2000ms, replay=1ops
> 2018-02-14 02:35:24,095 DEBUG [MultiHConnection-shared--pool8-t5]
> ipc.NettyRpcConnection(254): Connecting to
> asf903.gq1.ygridcore.net/67.195.81.147:59608
> 2018-02-14 02:35:24,110 DEBUG [RS-EventLoopGroup-4-24] ipc.FailedServers(56):
> Added failed server with address asf903.gq1.ygridcore.net/67.195.81.147:59608
> to list caused by
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
> syscall:getsockopt(..) failed: Connection refused:
> asf903.gq1.ygridcore.net/67.195.81.147:59608
> 2018-02-14 02:35:24,112 INFO [MultiHConnection-shared--pool8-t5]
> client.AsyncRequestFutureImpl(767): id=6, table=hbase:meta, attempt=7/46
> failed=1ops, last exception=java.net.ConnectException: Call to
> asf903.gq1.ygridcore.net/67.195.81.147:59608 failed on connection exception:
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
> syscall:getsockopt(..) failed: Connection refused:
> asf903.gq1.ygridcore.net/67.195.81.147:59608 on
> asf903.gq1.ygridcore.net,59608,1518575711969, tracking started null, retrying
> after=4005ms, replay=1ops
> 2018-02-14 02:35:28,120 DEBUG [MultiHConnection-shared--pool8-t5]
> ipc.NettyRpcConnection(254): Connecting to
> asf903.gq1.ygridcore.net/67.195.81.147:59608
> {code}
> Test set:
> org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService
> -------------------------------------------------------------------------------
> Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 600.83 s <<<
> FAILURE! - in
> org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService
> org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService
> Time elapsed: 583.188 s <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 600
> seconds
> at
> org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService.testAddVisibilityLabelsOnRSRestart(TestVisibilityLabelsWithDefaultVisLabelService.java:126)
> org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService
> Time elapsed: 583.227 s <<< ERROR!
> java.lang.Exception: Appears to be stuck in thread RS-EventLoopGroup-7-2
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)