[ 
https://issues.apache.org/jira/browse/CASSANDRA-14674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jaydeepkumar Chovatia updated CASSANDRA-14674:
----------------------------------------------
    Description: 
Validation request message as part of repair are currently sent as 
[sendOneWay|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/repair/ValidationTask.java#L56]
 and then it waits at 
[Futures.getUnchecked|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/repair/RepairJob.java#L160].
 If sender doesn’t hear back from receiver for whatever reason then thread is 
blocked forever. I’ve reproduced following stack trace at sender side by 
deliberately ignoring 
[VALIDATION_REQUST|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java#L114]
 at receiver side.
{quote}"Repair#1:1" #301 daemon prio=5 os_prio=0 tid=0x00007f5a62060800 
nid=0x13198 waiting on condition [0x00007f5a5cc6c000]
 java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 parking to wait for <0x00000005c6ba9630> (a 
com.google.common.util.concurrent.AbstractFuture$Sync)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
 at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
 at 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
 at 
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
 at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
 at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at 
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
 at 
org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$3/1858015030.run(Unknown
 Source)
 at java.lang.Thread.run(Thread.java:745)
{quote}
AFAIK we should be using {{sendRR}} for this instead of {{sendOneWay}}. Please 
let me know if my understanding is correct or not.

I am working on a fix to make it {{sendRR}}.

  was:
Validation request message as part of repair are currently sent as 
[sendOneWay|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/repair/ValidationTask.java#L56]
 and then it waits at 
[Futures.getUnchecked|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/repair/RepairJob.java#L160]].
 If sender doesn’t hear back from receiver for whatever reason then thread is 
blocked forever. I’ve reproduced following stack trace at sender side by 
deliberately ignoring 
[VALIDATION_REQUST|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java#L114]
 at receiver side.
{quote}
"Repair#1:1" #301 daemon prio=5 os_prio=0 tid=0x00007f5a62060800 nid=0x13198 
waiting on condition [0x00007f5a5cc6c000]
 java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 parking to wait for <0x00000005c6ba9630> (a 
com.google.common.util.concurrent.AbstractFuture$Sync)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
 at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
 at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
 at 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
 at 
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
 at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
 at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at 
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
 at 
org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$3/1858015030.run(Unknown
 Source)
 at java.lang.Thread.run(Thread.java:745)
{quote}

AFAIK we should be using {{sendRR}} for this instead of {{sendOneWay}}. Please 
let me know if my understanding is correct or not.

I am working on a fix to make it {{sendRR}}.



> Repair Validation message request could get stuck forever at sender side
> ------------------------------------------------------------------------
>
>                 Key: CASSANDRA-14674
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14674
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Repair
>            Reporter: Jaydeepkumar Chovatia
>            Assignee: Jaydeepkumar Chovatia
>            Priority: Major
>
> Validation request message as part of repair are currently sent as 
> [sendOneWay|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/repair/ValidationTask.java#L56]
>  and then it waits at 
> [Futures.getUnchecked|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/repair/RepairJob.java#L160].
>  If sender doesn’t hear back from receiver for whatever reason then thread is 
> blocked forever. I’ve reproduced following stack trace at sender side by 
> deliberately ignoring 
> [VALIDATION_REQUST|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/repair/RepairMessageVerbHandler.java#L114]
>  at receiver side.
> {quote}"Repair#1:1" #301 daemon prio=5 os_prio=0 tid=0x00007f5a62060800 
> nid=0x13198 waiting on condition [0x00007f5a5cc6c000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  parking to wait for <0x00000005c6ba9630> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
>  at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>  at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
>  at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
>  at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
>  at 
> org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$3/1858015030.run(Unknown
>  Source)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> AFAIK we should be using {{sendRR}} for this instead of {{sendOneWay}}. 
> Please let me know if my understanding is correct or not.
> I am working on a fix to make it {{sendRR}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to