[
https://issues.apache.org/jira/browse/HBASE-25774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337057#comment-17337057
]
Duo Zhang commented on HBASE-25774:
-----------------------------------
{noformat}
2021-04-28T06:22:27,806 DEBUG [Time-limited test]
replication.TestSyncReplicationStandbyKillRS(110): Going to verify the result,
1000 records expected
2021-04-28T06:22:27,808 DEBUG [RPCClient-NioEventLoopGroup-7-6]
client.AsyncRegionLocatorHelper(63): Try updating
region=hbase:meta,,1.1588230740, hostname=6839e4a4a176,39473,1619590928211,
seqNum=-1 , the old value is region=hbase:meta,,1.1588230740,
hostname=6839e4a4a176,39473,1619590928211, seqNum=-1,
error=java.net.ConnectException: Call to address=6839e4a4a176:39473null failed
on connection exception:
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
Connection refused: 6839e4a4a176/172.17.0.3:39473
2021-04-28T06:22:27,809 DEBUG [RPCClient-NioEventLoopGroup-7-6]
client.AsyncRegionLocatorHelper(71): The actual exception when updating
region=hbase:meta,,1.1588230740, hostname=6839e4a4a176,39473,1619590928211,
seqNum=-1 is java.net.ConnectException: Connection refused
2021-04-28T06:22:27,809 DEBUG [RPCClient-NioEventLoopGroup-7-6]
client.AsyncRegionLocatorHelper(87): Try removing
region=hbase:meta,,1.1588230740, hostname=6839e4a4a176,39473,1619590928211,
seqNum=-1 from cache
2021-04-28T06:22:27,809 DEBUG [RPCClient-NioEventLoopGroup-7-6]
ipc.FailedServers(53): Added failed server with address 6839e4a4a176:39473 to
list caused by
org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
Connection refused: 6839e4a4a176/172.17.0.3:39473
2021-04-28T06:22:27,823 DEBUG
[RS:3;6839e4a4a176:42643.replicationSource.wal-reader.6839e4a4a176%2C42643%2C1619590943524,1]
wal.ProtobufLogReader(422): EOF at position 381
2021-04-28T06:22:27,848 DEBUG
[RS:4;6839e4a4a176:44305.replicationSource.wal-reader.6839e4a4a176%2C44305%2C1619590944923,1]
wal.ProtobufLogReader(422): EOF at position 83
2021-04-28T06:22:27,894 DEBUG
[RS:5;6839e4a4a176:45961.replicationSource.wal-reader.6839e4a4a176%2C45961%2C1619590947375,1]
wal.ProtobufLogReader(422): EOF at position 83
2021-04-28T06:22:27,916 DEBUG [Async-Client-Retry-Timer-pool-0]
client.ConnectionUtils(548): Start fetching meta region location from registry
2021-04-28T06:22:27,920 DEBUG [RPCClient-NioEventLoopGroup-7-4]
client.ConnectionUtils(556): The fetched meta region location is
[region=hbase:meta,,1.1588230740, hostname=6839e4a4a176,44305,1619590944923,
seqNum=-1]
2021-04-28T06:22:27,921 DEBUG [RPCClient-NioEventLoopGroup-7-4]
ipc.RpcConnection(122): Using SIMPLE authentication for service=ClientService,
sasl=false
2021-04-28T06:22:27,924 INFO [RS-EventLoopGroup-14-2]
ipc.ServerRpcConnection(557): Connection from 172.17.0.3:37390,
version=3.0.0-SNAPSHOT, sasl=false, ugi=jenkins (auth:SIMPLE),
service=ClientService
2021-04-28T06:22:27,932 DEBUG
[RS:3;6839e4a4a176:42643.replicationSource.wal-reader.6839e4a4a176%2C42643%2C1619590943524,1]
wal.ProtobufLogReader(422): EOF at position 381
2021-04-28T06:22:27,955 DEBUG
[RS:4;6839e4a4a176:44305.replicationSource.wal-reader.6839e4a4a176%2C44305%2C1619590944923,1]
wal.ProtobufLogReader(422): EOF at position 83
2021-04-28T06:22:27,965 DEBUG [RPCClient-NioEventLoopGroup-7-10]
client.AsyncNonMetaRegionLocator(355): The fetched location of 'SyncRep',
row='\x00\x00\x00\x00', locateType=CURRENT is
[region=SyncRep,,1619590931098.3c8d57984856848ef8526f4bc5b3fbb8.,
hostname=6839e4a4a176,42643,1619590943524, seqNum=9]
2021-04-28T06:22:27,967 DEBUG [RPCClient-NioEventLoopGroup-7-10]
ipc.RpcConnection(122): Using SIMPLE authentication for service=ClientService,
sasl=false
2021-04-28T06:22:27,970 INFO [RS-EventLoopGroup-13-1]
ipc.ServerRpcConnection(557): Connection from 172.17.0.3:58964,
version=3.0.0-SNAPSHOT, sasl=false, ugi=jenkins (auth:SIMPLE),
service=ClientService
2021-04-28T06:22:27,971 DEBUG
[RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=42643] ipc.CallRunner(149):
callId: 28 service: ClientService methodName: Get size: 107 connection:
172.17.0.3:58964 deadline: 1619591007970,
exception=org.apache.hadoop.hbase.DoNotRetryIOException:
SyncRep,,1619590931098.3c8d57984856848ef8526f4bc5b3fbb8. is in STANDBY state.
2021-04-28T06:22:28,018 DEBUG
[RS:5;6839e4a4a176:45961.replicationSource.wal-reader.6839e4a4a176%2C45961%2C1619590947375,1]
wal.ProtobufLogReader(422): EOF at position 83
2021-04-28T06:22:28,072 DEBUG
[RS:3;6839e4a4a176:42643.replicationSource.wal-reader.6839e4a4a176%2C42643%2C1619590943524,1]
wal.ProtobufLogReader(422): EOF at position 381
2021-04-28T06:22:28,072 DEBUG
[RS:4;6839e4a4a176:44305.replicationSource.wal-reader.6839e4a4a176%2C44305%2C1619590944923,1]
wal.ProtobufLogReader(422): EOF at position 83
{noformat}
Seems the problem is that, we failed in the verify stage, and then we scheduled
another execution and messed up the output.
Checked the junit output, the errors are
{noformat}
[ERROR]
org.apache.hadoop.hbase.replication.TestSyncReplicationStandbyKillRS.testStandbyKillRegionServer
Time elapsed: 14.416 s <<< ERROR!
org.apache.hadoop.hbase.DoNotRetryIOException:
org.apache.hadoop.hbase.DoNotRetryIOException:
SyncRep,,1619590931098.3c8d57984856848ef8526f4bc5b3fbb8. is in STANDBY state.
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.rejectIfInStandByState(RSRpcServices.java:2546)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2568)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45249)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:135)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
at
org.apache.hadoop.hbase.replication.TestSyncReplicationStandbyKillRS.testStandbyKillRegionServer(TestSyncReplicationStandbyKillRS.java:111)
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException:
org.apache.hadoop.hbase.DoNotRetryIOException:
SyncRep,,1619590931098.3c8d57984856848ef8526f4bc5b3fbb8. is in STANDBY state.
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.rejectIfInStandByState(RSRpcServices.java:2546)
at
org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2568)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:45249)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:135)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
{noformat}
This means we have incosistent state between master and regionserver, as we
have verified that the peer has already been transit to DA.
Let me dig more.
> TestSyncReplicationStandbyKillRS#testStandbyKillRegionServer is flaky
> ---------------------------------------------------------------------
>
> Key: HBASE-25774
> URL: https://issues.apache.org/jira/browse/HBASE-25774
> Project: HBase
> Issue Type: Improvement
> Reporter: Xiaolin Ha
> Assignee: Duo Zhang
> Priority: Major
>
> [https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3025/9/testReport/org.apache.hadoop.hbase.replication/TestSyncReplicationStandbyKillRS/precommit_checks___yetus_jdk8_Hadoop3_checks______/]
> {code:java}
> ...[truncated 391170 chars]...
> 76d634:45149.replicationSource,1] regionserver.HRegionServer(2351): STOPPED:
> Unexpected exception in RS:2;ece3af76d634:45149.replicationSource,1
> 2021-04-11T11:14:40,268 INFO [RS:2;ece3af76d634:45149]
> regionserver.HeapMemoryManager(218): Stopping
> 2021-04-11T11:14:40,268 INFO [MemStoreFlusher.0]
> regionserver.MemStoreFlusher$FlushHandler(384): MemStoreFlusher.0 exiting
> 2021-04-11T11:14:40,268 INFO [RS:2;ece3af76d634:45149]
> flush.RegionServerFlushTableProcedureManager(118): Stopping region server
> flush procedure manager abruptly.
> 2021-04-11T11:14:40,270 INFO [RS:2;ece3af76d634:45149]
> snapshot.RegionServerSnapshotManager(136): Stopping
> RegionServerSnapshotManager abruptly.
> 2021-04-11T11:14:40,270 INFO [RS:2;ece3af76d634:45149]
> regionserver.HRegionServer(1146): aborting server
> ece3af76d634,45149,1618139661734
> 2021-04-11T11:14:40,272 ERROR
> [ReplicationExecutor-0.replicationSource,1-ece3af76d634,44745,1618139625245]
> regionserver.ReplicationSource(428): Unexpected exception in
> ReplicationExecutor-0.replicationSource,1-ece3af76d634,44745,1618139625245
> currentPath=null
> java.lang.IllegalStateException: Source should be active.
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:547)
> ~[classes/:?]
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
> 2021-04-11T11:14:40,272 DEBUG
> [ReplicationExecutor-0.replicationSource,1-ece3af76d634,44745,1618139625245]
> regionserver.HRegionServer(2576): Abort already in progress. Ignoring the
> current request with reason: Unexpected exception in
> ReplicationExecutor-0.replicationSource,1-ece3af76d634,44745,1618139625245
> {code}
> Maybe it should use HBASE-24877 to avoid failure of the initialize of
> ReplicationSource.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)