Re: Undeleted replication queue for removed peer found

2023-11-18 Thread Duo Zhang
I guess the problem is you exceeded the maximum size limit for
zookeeper multi operation.

I searched the code base of branch-1, you could try to set
'hbase.zookeeper.useMulti' to false in your hbase-site.xml to disable
multi so the operation could succeed. But it may introduce
inconsistency so you'd better find out why there are so many files
that need to be claimed or deleted, fix the problem and switch
hbase.zookeeper.useMulti back to true.

And the 1.4.x release line is already EOL, suggest you upgrade to the
current stable release line 2.5.x.

Thanks.

Manimekalai  于2023年11月18日周六 20:21写道:
>
> Dear Team,
>
> In one of the Hbase Cluster, some of the replication queue has not been
> properly removed, though the concerned peerId has been removed from
> list_peers.
>
> Due to this, I'm facing frequent region server restart has been
> occurring in the cluster where replication has to be written.
>
> I have tried to use hbase hbck -fixReplication. But it didn't work.
>
> The HBase Version is 1.4.14
>
> Below is the exception from Master and Regionserver respectively
> *Master Exception*
>
> 2023-11-18 13:01:30,815 ERROR
> > [172.XX.XX.XX,16020,1700289063450_ChoreService_2]
> > zookeeper.RecoverableZooKeeper: ZooKeeper multi failed after 4 attempts
> > 2023-11-18 13:01:30,815 WARN  
> > [172.XX.XX.XX,,16020,1700289063450_ChoreService_2]
> > cleaner.ReplicationZKNodeCleanerChore: Failed to clean replication zk node
> > java.io.IOException: Failed to delete queue, replicator:
> > 172.XX.XX.XX,,16020,1655822657566, queueId: 3
> > at
> > org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleaner$ReplicationQueueDeletor.
> > removeQueue(ReplicationZKNodeCleaner.java:160)
> > at
> > org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleaner.
> > removeQueues(ReplicationZKNodeCleaner.java:197)
> > at
> > org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleanerChore.chore(ReplicationZKNodeCleanerChore.java:49)
> > at
> > org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:189)
> > at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> > at
> > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> > at
> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> > at
> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> > at
> > org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:111)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > at java.lang.Thread.run(Thread.java:748)
>
>
>
> *RegionServer Exception*
>
> 2023-11-18 13:17:52,200 WARN  [main-SendThread(10.XX.XX.XX:2171)]
> > zookeeper.ClientCnxn: Session 0xXXX for server
> > 10.XX.XX.XX/10.XX.XX.XX:2171, unexpected error, closing socket connection
> > and attempting reconnect
> > java.io.IOException: Broken pipe
> > at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> > at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> > at
> > org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
> > at
> > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> > at
> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
> > 2023-11-18 13:17:52,300 ERROR [ReplicationExecutor-0]
> > zookeeper.RecoverableZooKeeper: ZooKeeper multi failed after 4 attempts
> > 2023-11-18 13:17:52,300 WARN  [ReplicationExecutor-0]
> > replication.ReplicationQueuesZKImpl: Got exception in
> > copyQueuesFromRSUsingMulti:
> > org.apache.zookeeper.KeeperException$ConnectionLossException:
> > KeeperErrorCode = ConnectionLoss
> > at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> > at
> > org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:992)
> > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910)
> > at
> > org.a

Undeleted replication queue for removed peer found

2023-11-18 Thread Manimekalai
Dear Team,

In one of the Hbase Cluster, some of the replication queue has not been
properly removed, though the concerned peerId has been removed from
list_peers.

Due to this, I'm facing frequent region server restart has been
occurring in the cluster where replication has to be written.

I have tried to use hbase hbck -fixReplication. But it didn't work.

The HBase Version is 1.4.14

Below is the exception from Master and Regionserver respectively
*Master Exception*

2023-11-18 13:01:30,815 ERROR
> [172.XX.XX.XX,16020,1700289063450_ChoreService_2]
> zookeeper.RecoverableZooKeeper: ZooKeeper multi failed after 4 attempts
> 2023-11-18 13:01:30,815 WARN  
> [172.XX.XX.XX,,16020,1700289063450_ChoreService_2]
> cleaner.ReplicationZKNodeCleanerChore: Failed to clean replication zk node
> java.io.IOException: Failed to delete queue, replicator:
> 172.XX.XX.XX,,16020,1655822657566, queueId: 3
> at
> org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleaner$ReplicationQueueDeletor.
> removeQueue(ReplicationZKNodeCleaner.java:160)
> at
> org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleaner.
> removeQueues(ReplicationZKNodeCleaner.java:197)
> at
> org.apache.hadoop.hbase.master.cleaner.ReplicationZKNodeCleanerChore.chore(ReplicationZKNodeCleanerChore.java:49)
> at
> org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:189)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at
> org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:111)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



*RegionServer Exception*

2023-11-18 13:17:52,200 WARN  [main-SendThread(10.XX.XX.XX:2171)]
> zookeeper.ClientCnxn: Session 0xXXX for server
> 10.XX.XX.XX/10.XX.XX.XX:2171, unexpected error, closing socket connection
> and attempting reconnect
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
> at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> at
> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
> at
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
> 2023-11-18 13:17:52,300 ERROR [ReplicationExecutor-0]
> zookeeper.RecoverableZooKeeper: ZooKeeper multi failed after 4 attempts
> 2023-11-18 13:17:52,300 WARN  [ReplicationExecutor-0]
> replication.ReplicationQueuesZKImpl: Got exception in
> copyQueuesFromRSUsingMulti:
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> at
> org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:992)
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:910)
> at
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:672)
> at
> org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1685)
> at
> org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.moveQueueUsingMulti(ReplicationQueuesZKImpl.java:410)
> at
> org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.claimQueue(ReplicationQueuesZKImpl.java:257)
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:700)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



Please help to solve this issue.


Regards,
Manimekalai K


Replication queue?

2013-08-20 Thread Jean-Marc Spaggiari
Hi,

If I have a master - slave replication, and master went down, replication
will start back where it was when master will come back online. Fine.
If I have a master - slave replication, and slave went down, is the data
queued until the slave come back online and then sent? If so, how big can
be this queu, and how long can the slave be down?

Same questions for master - master... I guess for this one, it's like for
the 1 line above and it's fine, right?

Thanks,

JM


Re: Replication queue?

2013-08-20 Thread Jean-Daniel Cryans
You can find a lot here: http://hbase.apache.org/replication.html

And how many logs you can queue is how much disk space you have :)


On Tue, Aug 20, 2013 at 7:23 AM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 Hi,

 If I have a master - slave replication, and master went down, replication
 will start back where it was when master will come back online. Fine.
 If I have a master - slave replication, and slave went down, is the data
 queued until the slave come back online and then sent? If so, how big can
 be this queu, and how long can the slave be down?

 Same questions for master - master... I guess for this one, it's like for
 the 1 line above and it's fine, right?

 Thanks,

 JM



Re: Replication queue?

2013-08-20 Thread Jean-Marc Spaggiari
RTFM? ;)

Thanks for pointing me to this link! I have all the responses I need there.

JM

2013/8/20 Jean-Daniel Cryans jdcry...@apache.org

 You can find a lot here: http://hbase.apache.org/replication.html

 And how many logs you can queue is how much disk space you have :)


 On Tue, Aug 20, 2013 at 7:23 AM, Jean-Marc Spaggiari 
 jean-m...@spaggiari.org wrote:

  Hi,
 
  If I have a master - slave replication, and master went down,
 replication
  will start back where it was when master will come back online. Fine.
  If I have a master - slave replication, and slave went down, is the data
  queued until the slave come back online and then sent? If so, how big can
  be this queu, and how long can the slave be down?
 
  Same questions for master - master... I guess for this one, it's like
 for
  the 1 line above and it's fine, right?
 
  Thanks,
 
  JM