Agree with you on that. Just wanted to highlight that I am experiencing the same behavior.
Regards Manish On Tue, Jan 18, 2022, 22:50 Bowen Song <bo...@bso.ng> wrote: > The link was related to Cassandra 1.2, and it was 9 years ago. Cassandra > was full of bugs at that time, and it has improved a lot since then. For > that reason, I would rather not compare the issue you have with some 9 > years old issues someone else had. > > > On 18/01/2022 16:11, manish khandelwal wrote: > > I am not sure what is happening but it has happened thrice. It is > happening that merkle trees are not received from nodes of other data > center. Getting issue on similar lines as mentioned here > https://user.cassandra.apache.narkive.com/GTbqO6za/repair-hangs-when-merkle-tree-request-is-not-acknowledged > > Regards > Manish > > On Tue, Jan 18, 2022, 18:18 Bowen Song <bo...@bso.ng> wrote: > >> Keep reading the log on the initiator and the node sending the merkle >> tree, anything follows that? FYI, not all log has the repair ID in it, >> therefore please read the relevant logs in the chronological order without >> filtering (e.g. "grep") on the repair ID. >> >> I'm sceptical network issue is causing all this. The merkle tree is send >> over TCP connections, therefore some dropped packets over a few second of >> network connectivity issue occasionally should not cause any issue to the >> repair. You should only start to see network related issues if the network >> problem persists over a period of time close to or longer than the timeout >> values set in the cassandra.yaml file, in the case of repair it's the >> request_timeout_in_ms which is default to 10 seconds. >> >> Carry on examine the logs, you may find something useful. >> >> BTW, talking about stuck repair, in my experience this can happen if two >> or more repairs were ran concurrently on the same node (regardless which >> node was the initiator) involving the same table. This could happen if you >> accidentally ran "nodetool repair" on two nodes and both involve the same >> table, or if you cancelled and then restarted a "nodetool repair" on a node >> without waiting or killing the remannings of the first repair session on >> other nodes. >> On 18/01/2022 11:55, manish khandelwal wrote: >> >> In the system logs, on the node where repair was initiated, I see that >> the node has requested merkle tree from all nodes including itself >> >> INFO [Repair#3:1] 2022-01-14 03:32:18,805 RepairJob.java:172 - *[repair >> #6e3385e0-74d1-11ec-8e66-9f084ace9968*] Requesting merkle trees for >> *tablename* (to [*/xyz.abc.def.14, /xyz.abc.def.13, /xyz.abc.def.12, >> /xyz.mkn.pq.18, /xyz.mkn.pq.16, /xyz.mkn.pq.17*]) >> INFO [AntiEntropyStage:1] 2022-01-14 03:32:18,841 RepairSession.java:180 >> - [repair #6e3385e0-74d1-11ec-8e66-9f084ace9968] Received merkle tree for >> *tablename* from */xyz.mkn.pq.17* >> INFO [AntiEntropyStage:1] 2022-01-14 03:32:18,847 RepairSession.java:180 >> - [repair #6e3385e0-74d1-11ec-8e66-9f084ace9968] Received merkle tree for >> *tablename* from */xyz.mkn.pq.16* >> INFO [AntiEntropyStage:1] 2022-01-14 03:32:18,851 RepairSession.java:180 >> - [repair #6e3385e0-74d1-11ec-8e66-9f084ace9968] Received merkle tree for >> *tablename* from */xyz.mkn.pq.18* >> INFO [AntiEntropyStage:1] 2022-01-14 03:32:18,856 RepairSession.java:180 >> - [repair #6e3385e0-74d1-11ec-8e66-9f084ace9968] Received merkle tree for >> *tablename* from */xyz.abc.def.14* >> Line 2480: INFO [AntiEntropyStage:1] *2022-01-14 03:32:18*,876 >> RepairSession.java:180 - [*repair #6e3385e0-74d1-11ec-8e66-9f084ace9968*] >> Received merkle tree for *tablename* from */xyz.abc.def.12* >> >> As per the logs merkle tree is not received from node with ip >> *xyz.abc.def.13* >> >> In the system logs of node with ip *xyz.abc.def.13, *I can see following >> logs >> >> NFO [AntiEntropyStage:1] *2022-01-14 03:32:18*,850 Validator.java:281 - >> [*repair #6e3385e0-74d1-11ec-8e66-9f084ace9968*] Sending completed >> merkle tree to */* *xyz.mkn.pq.17* for *keyspace.tablename* >> >> From the above I inferred that the repair task has become orphaned since >> it is waiting for merkle tree from a node and it is not going to receive it >> since it has been lost in the network somewhere between. >> >> Regards >> Manish >> >> On Tue, Jan 18, 2022 at 4:39 PM Bowen Song <bo...@bso.ng> wrote: >> >>> The entry in the debug.log is not specific to a repair session, and it >>> could also be caused by reasons other than network connectivity issue, such >>> as long STW GC pauses. I usually don't start troubleshooting an issue from >>> the debug log, as it can be rather noisy. The system.log is a better >>> starting point. >>> >>> If I was to troubleshoot the issue, I would start from the system logs >>> on the node that initiated the repair, i.e. the node you ran the "nodetool >>> repair" command on. Follow the repair ID (an UUID) in the logs on all nodes >>> involved in the repair and read all related logs in chronological order to >>> find out what exactly had happened. >>> >>> BTW, If the issue is easily reproducible, I would re-run the repair with >>> a reduce scope (such as table and token range) to get less logs related to >>> the repair session. Less logs means less time spend on reading and >>> analysing them. >>> >>> Hope this helps. >>> On 18/01/2022 10:03, manish khandelwal wrote: >>> >>> I have a Cassandra 3.11.2 cluster with two DCs. While running repair , I >>> am observing the following behavior. >>> >>> I am seeing that node is not able to receive merkle tree from one or two >>> nodes. Also I am able to see that the missing nodes did send the merkle >>> tree but it was not received. This make repair hangs on consistent basis. >>> In netstats I can see output as follows >>> >>> *Mode: NORMAL* >>> *Not sending any streams. Attempted: 7858888* >>> *Mismatch (Blocking): 2560* >>> *Mismatch (Background): 17173* >>> *Pool Name Active Pending Completed Dropped* >>> *Large messages n/a 0 6313 3* >>> *Small messages n/a 0 55978004 3* >>> *Gossip messages n/a 0 93756 125**Does it represent network issues? In >>> Debug logs I saw something*DEBUG >>> [MessagingService-Outgoing-hostname/xxx.yy.zz.kk-Large] 2022-01-14 >>> 05:00:19,031 OutboundTcpConnection.java:349 - Error writing to >>> hostname/xxx.yy.zz.kk >>> java.io.IOException: Connection timed out >>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_221] >>> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) >>> ~[na:1.8.0_221] >>> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) >>> ~[na:1.8.0_221] >>> at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[na:1.8.0_221] >>> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) >>> ~[na:1.8.0_221] >>> at java.nio.channels.Channels.writeFullyImpl(Channels.java:78) >>> ~[na:1.8.0_221] >>> at java.nio.channels.Channels.writeFully(Channels.java:98) >>> ~[na:1.8.0_221] >>> at java.nio.channels.Channels.access$000(Channels.java:61) >>> ~[na:1.8.0_221] >>> at java.nio.channels.Channels$1.write(Channels.java:174) ~[na:1.8.0_221] >>> at >>> net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205) >>> ~[lz4-1.3.0.jar:na] >>> at >>> net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158) >>> ~[lz4-1.3.0.jar:na] (edited) >>> >>> Does this show any network fluctuations? >>> >>> Regards >>> Manish >>> >>> >>>