[
https://issues.apache.org/jira/browse/HBASE-20829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16531442#comment-16531442
]
Duo Zhang commented on HBASE-20829:
-----------------------------------
[~zghaobac] FYI. Seem a problem for replaying remote wals...
{noformat}
2018-07-03 12:25:11,375 WARN [RSProcedureDispatcher-pool13-t19]
replication.SyncReplicationReplayWALRemoteProcedure(107): Replay wals
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
on asf916.gq1.ygridcore.net,33811,1530620636539 failed for peer id=1
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server
asf916.gq1.ygridcore.net,33811,1530620636539 is not online
at
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$DeadRSRemoteCall.call(RSProcedureDispatcher.java:285)
at
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$DeadRSRemoteCall.call(RSProcedureDispatcher.java:276)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-07-03 12:25:11,440 DEBUG [Thread-2883]
replication.TestSyncReplicationStandbyKillRS(111): Server
[asf916.gq1.ygridcore.net,33811,1530620636539] marked as dead, waiting for it
to finish dead processing
2018-07-03 12:25:11,441 DEBUG [Thread-2883]
replication.TestSyncReplicationStandbyKillRS(114): Server
[asf916.gq1.ygridcore.net,33811,1530620636539] still being processed, waiting
2018-07-03 12:25:11,456 WARN [RS:3;asf916:45751] wal.AbstractFSWAL(419):
'hbase.regionserver.maxlogs' was deprecated.
2018-07-03 12:25:11,457 INFO [RS:3;asf916:45751] wal.AbstractFSWAL(424): WAL
configuration: blocksize=256 MB, rollsize=128 MB,
prefix=asf916.gq1.ygridcore.net%2C45751%2C1530620709275, suffix=,
logDir=hdfs://localhost:42624/user/jenkins/test-data/a86a805e-162f-5f22-7b9e-573dbf0f40fb/WALs/asf916.gq1.ygridcore.net,45751,1530620709275,
archiveDir=hdfs://localhost:42624/user/jenkins/test-data/a86a805e-162f-5f22-7b9e-573dbf0f40fb/oldWALs
2018-07-03 12:25:11,467 DEBUG [RS-EventLoopGroup-14-4]
asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper(737): SASL client skipping
handshake in unsecured configuration for addr = 127.0.0.1/127.0.0.1, datanodeId
=
DatanodeInfoWithStorage[127.0.0.1:38997,DS-6002160d-388b-4840-8538-e4c2255108be,DISK]
2018-07-03 12:25:11,467 DEBUG [RS-EventLoopGroup-14-5]
asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper(737): SASL client skipping
handshake in unsecured configuration for addr = 127.0.0.1/127.0.0.1, datanodeId
=
DatanodeInfoWithStorage[127.0.0.1:45904,DS-e189e3c8-a1bd-475c-86c0-3891e541fc6e,DISK]
2018-07-03 12:25:11,467 DEBUG [RS-EventLoopGroup-14-3]
asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper(737): SASL client skipping
handshake in unsecured configuration for addr = 127.0.0.1/127.0.0.1, datanodeId
=
DatanodeInfoWithStorage[127.0.0.1:39589,DS-62ced3f8-35c4-4904-80cc-4d514b8f4050,DISK]
2018-07-03 12:25:11,495 DEBUG [RegionServerTracker-0]
procedure2.ProcedureExecutor(887): Stored pid=30,
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
server=asf916.gq1.ygridcore.net,33811,1530620636539, splitWal=true, meta=true
2018-07-03 12:25:11,495 DEBUG [RegionServerTracker-0]
assignment.AssignmentManager(1321):
Added=asf916.gq1.ygridcore.net,33811,1530620636539 to dead servers, submitted
shutdown handler to be executed meta=true
2018-07-03 12:25:11,498 INFO [PEWorker-6] procedure.ServerCrashProcedure(118):
Start pid=30, state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
server=asf916.gq1.ygridcore.net,33811,1530620636539, splitWal=true, meta=true
2018-07-03 12:25:11,500 WARN [RegionServerTracker-0]
replication.SyncReplicationReplayWALRemoteProcedure(107): Replay wals
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
on asf916.gq1.ygridcore.net,33811,1530620636539 failed for peer id=1
org.apache.hadoop.hbase.DoNotRetryIOException: server not online
asf916.gq1.ygridcore.net,33811,1530620636539
at
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:130)
at
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:60)
at
org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher$BufferNode.abortOperationsInQueue(RemoteProcedureDispatcher.java:380)
at
org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.removeNode(RemoteProcedureDispatcher.java:193)
at
org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.serverRemoved(RSProcedureDispatcher.java:143)
at
org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:610)
at
org.apache.hadoop.hbase.master.RegionServerTracker.refresh(RegionServerTracker.java:160)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-07-03 12:25:11,503 WARN [PEWorker-4]
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
operation for replay wals
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
because the server is already dead, retry
2018-07-03 12:25:11,503 WARN [PEWorker-4]
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
operation for replay wals
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
because the server is already dead, retry
2018-07-03 12:25:11,503 WARN [PEWorker-4]
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
operation for replay wals
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
because the server is already dead, retry
2018-07-03 12:25:11,503 WARN [PEWorker-7]
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
operation for replay wals
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
because the server is already dead, retry
2018-07-03 12:25:11,504 WARN [PEWorker-7]
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
operation for replay wals
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
because the server is already dead, retry
2018-07-03 12:25:11,504 WARN [PEWorker-7]
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
operation for replay wals
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
because the server is already dead, retry
2018-07-03 12:25:11,504 WARN [PEWorker-7]
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
operation for replay wals
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
because the server is already dead, retry
2018-07-03 12:25:11,504 WARN [PEWorker-7]
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
operation for replay wals
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
because the server is already dead, retry
2018-07-03 12:25:11,504 WARN [PEWorker-7]
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
operation for replay wals
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
because the server is already dead, retry
2018-07-03 12:25:11,504 WARN [PEWorker-7]
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
operation for replay wals
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
because the server is already dead, retry
2018-07-03 12:25:11,504 WARN [PEWorker-7]
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
operation for replay wals
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
because the server is already dead, retry
2018-07-03 12:25:11,505 WARN [PEWorker-11]
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
operation for replay wals
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
because the server is already dead, retry
2018-07-03 12:25:11,505 WARN [PEWorker-8]
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
operation for replay wals
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
because the server is already dead, retry
2018-07-03 12:25:11,505 WARN [PEWorker-8]
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
operation for replay wals
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
because the server is already dead, retry
2018-07-03 12:25:11,505 WARN [PEWorker-8]
replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
operation for replay wals
[remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
because the server is already dead, retry
{noformat}
> Remove the addFront assertion in MasterProcedureScheduler.doAdd
> ---------------------------------------------------------------
>
> Key: HBASE-20829
> URL: https://issues.apache.org/jira/browse/HBASE-20829
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Reporter: Duo Zhang
> Assignee: Duo Zhang
> Priority: Major
> Fix For: 3.0.0, 2.1.0, 2.2.0
>
> Attachments: HBASE-20829-debug.patch, HBASE-20829-v1.patch,
> HBASE-20829.patch,
> org.apache.hadoop.hbase.replication.TestSyncReplicationStandbyKillRS-output.txt
>
>
> Timed out.
> {noformat}
> 2018-06-30 01:32:33,823 ERROR [Time-limited test]
> replication.TestSyncReplicationStandbyKillRS(93): Failed to transit standby
> cluster to DOWNGRADE_ACTIVE
> {noformat}
> We failed to transit the state to DA and then wait for it to become DA so
> hang there.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)