Re: Undeleted replication queue for removed peer found
I guess the problem is you exceeded the maximum size limit for zookeeper multi operation. I searched the code base of branch-1, you could try to set 'hbase.zookeeper.useMulti' to false in your hbase-site.xml to disable multi so the operation could succeed. But it may introduce inconsistency so you'd better find out why there are so many files that need to be claimed or deleted, fix the problem and switch hbase.zookeeper.useMulti back to true. And the 1.4.x release line is already EOL, suggest you upgrade to the current stable release line 2.5.x. Thanks. Manimekalai 于2023年11月18日周六 20:21写道: > > Dear Team, > > In one of the Hbase Cluster, some of the replication queue has not been > properly removed, though the concerned peerId has been removed from > list_peers. > > Due to this, I'm facing frequent region server restart has been > occurring in the cluster where replication has to be written. > > I have tried to use hbase hbck -fixReplication. But it didn't work. > > The HBase Version is 1.4.14 > > Below is the exception from Master and Regionserver respectively > *Master Exception* > > 2023-11-18 13:01:30,815 ERROR > > [172.XX.XX.XX,16020,1700289063450_ChoreService_2] > > zookeeper.RecoverableZooKeeper: ZooKeeper multi failed after 4 attempts > > 2023-11-18 13:01:30,815 WARN > > [172.XX.XX.XX,,16020,1700289063450_ChoreService_2] > > cleaner.ReplicationZKNodeCleanerChore: Failed to clean replication zk node > > java.io.IOException: Failed to delete queue, replicator: > > 172.XX.XX.XX,,16020,1655822657566, queueId: 3 > > at > > org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleaner$ReplicationQueueDeletor. > > removeQueue(ReplicationZKNodeCleaner.java:160) > > at > > org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleaner. > > removeQueues(ReplicationZKNodeCleaner.java:197) > > at > > org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleanerChore.chore(ReplicationZKNodeCleanerChore.java:49) > > at > > org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:189) > > at > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > > at > > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > > at > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > > at > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > > at > > org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:111) > > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > at java.lang.Thread.run(Thread.java:748) > > > > *RegionServer Exception* > > 2023-11-18 13:17:52,200 WARN [main-SendThread(10.XX.XX.XX:2171)] > > zookeeper.ClientCnxn: Session 0xXXX for server > > 10.XX.XX.XX/10.XX.XX.XX:2171, unexpected error, closing socket connection > > and attempting reconnect > > java.io.IOException: Broken pipe > > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) > > at > > org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) > > at > > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) > > at > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) > > 2023-11-18 13:17:52,300 ERROR [ReplicationExecutor-0] > > zookeeper.RecoverableZooKeeper: ZooKeeper multi failed after 4 attempts > > 2023-11-18 13:17:52,300 WARN [ReplicationExecutor-0] > > replication.ReplicationQueuesZKImpl: Got exception in > > copyQueuesFromRSUsingMulti: > > org.apache.zookeeper.KeeperException$ConnectionLossException: > > KeeperErrorCode = ConnectionLoss > > at > > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > > at > > org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:992) > > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910) > > at > > org.a
Undeleted replication queue for removed peer found
Dear Team, In one of the Hbase Cluster, some of the replication queue has not been properly removed, though the concerned peerId has been removed from list_peers. Due to this, I'm facing frequent region server restart has been occurring in the cluster where replication has to be written. I have tried to use hbase hbck -fixReplication. But it didn't work. The HBase Version is 1.4.14 Below is the exception from Master and Regionserver respectively *Master Exception* 2023-11-18 13:01:30,815 ERROR > [172.XX.XX.XX,16020,1700289063450_ChoreService_2] > zookeeper.RecoverableZooKeeper: ZooKeeper multi failed after 4 attempts > 2023-11-18 13:01:30,815 WARN > [172.XX.XX.XX,,16020,1700289063450_ChoreService_2] > cleaner.ReplicationZKNodeCleanerChore: Failed to clean replication zk node > java.io.IOException: Failed to delete queue, replicator: > 172.XX.XX.XX,,16020,1655822657566, queueId: 3 > at > org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleaner$ReplicationQueueDeletor. > removeQueue(ReplicationZKNodeCleaner.java:160) > at > org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleaner. > removeQueues(ReplicationZKNodeCleaner.java:197) > at > org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleanerChore.chore(ReplicationZKNodeCleanerChore.java:49) > at > org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:189) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:111) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) *RegionServer Exception* 2023-11-18 13:17:52,200 WARN [main-SendThread(10.XX.XX.XX:2171)] > zookeeper.ClientCnxn: Session 0xXXX for server > 10.XX.XX.XX/10.XX.XX.XX:2171, unexpected error, closing socket connection > and attempting reconnect > java.io.IOException: Broken pipe > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:65) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117) > at > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) > 2023-11-18 13:17:52,300 ERROR [ReplicationExecutor-0] > zookeeper.RecoverableZooKeeper: ZooKeeper multi failed after 4 attempts > 2023-11-18 13:17:52,300 WARN [ReplicationExecutor-0] > replication.ReplicationQueuesZKImpl: Got exception in > copyQueuesFromRSUsingMulti: > org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) > at > org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:992) > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910) > at > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:672) > at > org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1685) > at > org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.moveQueueUsingMulti(ReplicationQueuesZKImpl.java:410) > at > org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.claimQueue(ReplicationQueuesZKImpl.java:257) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:700) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) Please help to solve this issue. Regards, Manimekalai K
Replication queue?
Hi, If I have a master - slave replication, and master went down, replication will start back where it was when master will come back online. Fine. If I have a master - slave replication, and slave went down, is the data queued until the slave come back online and then sent? If so, how big can be this queu, and how long can the slave be down? Same questions for master - master... I guess for this one, it's like for the 1 line above and it's fine, right? Thanks, JM
Re: Replication queue?
You can find a lot here: http://hbase.apache.org/replication.html And how many logs you can queue is how much disk space you have :) On Tue, Aug 20, 2013 at 7:23 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi, If I have a master - slave replication, and master went down, replication will start back where it was when master will come back online. Fine. If I have a master - slave replication, and slave went down, is the data queued until the slave come back online and then sent? If so, how big can be this queu, and how long can the slave be down? Same questions for master - master... I guess for this one, it's like for the 1 line above and it's fine, right? Thanks, JM
Re: Replication queue?
RTFM? ;) Thanks for pointing me to this link! I have all the responses I need there. JM 2013/8/20 Jean-Daniel Cryans jdcry...@apache.org You can find a lot here: http://hbase.apache.org/replication.html And how many logs you can queue is how much disk space you have :) On Tue, Aug 20, 2013 at 7:23 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi, If I have a master - slave replication, and master went down, replication will start back where it was when master will come back online. Fine. If I have a master - slave replication, and slave went down, is the data queued until the slave come back online and then sent? If so, how big can be this queu, and how long can the slave be down? Same questions for master - master... I guess for this one, it's like for the 1 line above and it's fine, right? Thanks, JM