[
https://issues.apache.org/jira/browse/HBASE-20842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Guanghao Zhang updated HBASE-20842:
-----------------------------------
Attachment: HBASE-20842.master.002.patch
> Infinite loop when replaying remote wals
> ----------------------------------------
>
> Key: HBASE-20842
> URL: https://issues.apache.org/jira/browse/HBASE-20842
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Reporter: Duo Zhang
> Assignee: Guanghao Zhang
> Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-20842.master.001.patch,
> HBASE-20842.master.002.patch
>
>
> {noformat}
> 2018-07-03 12:25:11,375 WARN [RSProcedureDispatcher-pool13-t19]
> replication.SyncReplicationReplayWALRemoteProcedure(107): Replay wals
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
> on asf916.gq1.ygridcore.net,33811,1530620636539 failed for peer id=1
> org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server
> asf916.gq1.ygridcore.net,33811,1530620636539 is not online
> at
> org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$DeadRSRemoteCall.call(RSProcedureDispatcher.java:285)
> at
> org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher$DeadRSRemoteCall.call(RSProcedureDispatcher.java:276)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2018-07-03 12:25:11,440 DEBUG [Thread-2883]
> replication.TestSyncReplicationStandbyKillRS(111): Server
> [asf916.gq1.ygridcore.net,33811,1530620636539] marked as dead, waiting for it
> to finish dead processing
> 2018-07-03 12:25:11,441 DEBUG [Thread-2883]
> replication.TestSyncReplicationStandbyKillRS(114): Server
> [asf916.gq1.ygridcore.net,33811,1530620636539] still being processed, waiting
> 2018-07-03 12:25:11,456 WARN [RS:3;asf916:45751] wal.AbstractFSWAL(419):
> 'hbase.regionserver.maxlogs' was deprecated.
> 2018-07-03 12:25:11,457 INFO [RS:3;asf916:45751] wal.AbstractFSWAL(424): WAL
> configuration: blocksize=256 MB, rollsize=128 MB,
> prefix=asf916.gq1.ygridcore.net%2C45751%2C1530620709275, suffix=,
> logDir=hdfs://localhost:42624/user/jenkins/test-data/a86a805e-162f-5f22-7b9e-573dbf0f40fb/WALs/asf916.gq1.ygridcore.net,45751,1530620709275,
>
> archiveDir=hdfs://localhost:42624/user/jenkins/test-data/a86a805e-162f-5f22-7b9e-573dbf0f40fb/oldWALs
> 2018-07-03 12:25:11,467 DEBUG [RS-EventLoopGroup-14-4]
> asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper(737): SASL client skipping
> handshake in unsecured configuration for addr = 127.0.0.1/127.0.0.1,
> datanodeId =
> DatanodeInfoWithStorage[127.0.0.1:38997,DS-6002160d-388b-4840-8538-e4c2255108be,DISK]
> 2018-07-03 12:25:11,467 DEBUG [RS-EventLoopGroup-14-5]
> asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper(737): SASL client skipping
> handshake in unsecured configuration for addr = 127.0.0.1/127.0.0.1,
> datanodeId =
> DatanodeInfoWithStorage[127.0.0.1:45904,DS-e189e3c8-a1bd-475c-86c0-3891e541fc6e,DISK]
> 2018-07-03 12:25:11,467 DEBUG [RS-EventLoopGroup-14-3]
> asyncfs.FanOutOneBlockAsyncDFSOutputSaslHelper(737): SASL client skipping
> handshake in unsecured configuration for addr = 127.0.0.1/127.0.0.1,
> datanodeId =
> DatanodeInfoWithStorage[127.0.0.1:39589,DS-62ced3f8-35c4-4904-80cc-4d514b8f4050,DISK]
> 2018-07-03 12:25:11,495 DEBUG [RegionServerTracker-0]
> procedure2.ProcedureExecutor(887): Stored pid=30,
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
> server=asf916.gq1.ygridcore.net,33811,1530620636539, splitWal=true, meta=true
> 2018-07-03 12:25:11,495 DEBUG [RegionServerTracker-0]
> assignment.AssignmentManager(1321):
> Added=asf916.gq1.ygridcore.net,33811,1530620636539 to dead servers, submitted
> shutdown handler to be executed meta=true
> 2018-07-03 12:25:11,498 INFO [PEWorker-6]
> procedure.ServerCrashProcedure(118): Start pid=30,
> state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure
> server=asf916.gq1.ygridcore.net,33811,1530620636539, splitWal=true, meta=true
> 2018-07-03 12:25:11,500 WARN [RegionServerTracker-0]
> replication.SyncReplicationReplayWALRemoteProcedure(107): Replay wals
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
> on asf916.gq1.ygridcore.net,33811,1530620636539 failed for peer id=1
> org.apache.hadoop.hbase.DoNotRetryIOException: server not online
> asf916.gq1.ygridcore.net,33811,1530620636539
> at
> org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:130)
> at
> org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:60)
> at
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher$BufferNode.abortOperationsInQueue(RemoteProcedureDispatcher.java:380)
> at
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.removeNode(RemoteProcedureDispatcher.java:193)
> at
> org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.serverRemoved(RSProcedureDispatcher.java:143)
> at
> org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:610)
> at
> org.apache.hadoop.hbase.master.RegionServerTracker.refresh(RegionServerTracker.java:160)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2018-07-03 12:25:11,503 WARN [PEWorker-4]
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
> operation for replay wals
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
> on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
> because the server is already dead, retry
> 2018-07-03 12:25:11,503 WARN [PEWorker-4]
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
> operation for replay wals
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
> on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
> because the server is already dead, retry
> 2018-07-03 12:25:11,503 WARN [PEWorker-4]
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
> operation for replay wals
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
> on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
> because the server is already dead, retry
> 2018-07-03 12:25:11,503 WARN [PEWorker-7]
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
> operation for replay wals
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
> on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
> because the server is already dead, retry
> 2018-07-03 12:25:11,504 WARN [PEWorker-7]
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
> operation for replay wals
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
> on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
> because the server is already dead, retry
> 2018-07-03 12:25:11,504 WARN [PEWorker-7]
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
> operation for replay wals
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
> on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
> because the server is already dead, retry
> 2018-07-03 12:25:11,504 WARN [PEWorker-7]
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
> operation for replay wals
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
> on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
> because the server is already dead, retry
> 2018-07-03 12:25:11,504 WARN [PEWorker-7]
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
> operation for replay wals
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
> on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
> because the server is already dead, retry
> 2018-07-03 12:25:11,504 WARN [PEWorker-7]
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
> operation for replay wals
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
> on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
> because the server is already dead, retry
> 2018-07-03 12:25:11,504 WARN [PEWorker-7]
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
> operation for replay wals
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
> on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
> because the server is already dead, retry
> 2018-07-03 12:25:11,504 WARN [PEWorker-7]
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
> operation for replay wals
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
> on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
> because the server is already dead, retry
> 2018-07-03 12:25:11,505 WARN [PEWorker-11]
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
> operation for replay wals
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
> on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
> because the server is already dead, retry
> 2018-07-03 12:25:11,505 WARN [PEWorker-8]
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
> operation for replay wals
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
> on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
> because the server is already dead, retry
> 2018-07-03 12:25:11,505 WARN [PEWorker-8]
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
> operation for replay wals
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
> on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
> because the server is already dead, retry
> 2018-07-03 12:25:11,505 WARN [PEWorker-8]
> replication.SyncReplicationReplayWALRemoteProcedure(162): Can not add remote
> operation for replay wals
> [remoteWALs/1-replay/asf916.gq1.ygridcore.net%2C36931%2C1530620616106-1530620683061-1.1530620683075.syncrep]
> on asf916.gq1.ygridcore.net,33811,1530620636539 for peer id=1, this usually
> because the server is already dead, retry
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)