dcapwell commented on code in PR #1796:
URL: https://github.com/apache/cassandra/pull/1796#discussion_r950462795


##########
src/java/org/apache/cassandra/service/TruncateResponseHandler.java:
##########
@@ -61,24 +63,34 @@ public TruncateResponseHandler(int responseCount)
     public void get() throws TimeoutException
     {
         long timeoutNanos = getTruncateRpcTimeout(NANOSECONDS) - (nanoTime() - 
start);
-        boolean completedInTime;
+        boolean signaled;
         try
         {
-            completedInTime = condition.await(timeoutNanos, NANOSECONDS); // 
TODO truncate needs a much longer timeout
+            signaled = condition.await(timeoutNanos, NANOSECONDS); // TODO 
truncate needs a much longer timeout
         }
         catch (InterruptedException e)
         {
             throw new UncheckedInterruptedException(e);
         }
 
-        if (!completedInTime)
-        {
+        if (!signaled)
             throw new TimeoutException("Truncate timed out - received only " + 
responses.get() + " responses");
-        }
 
-        if (truncateFailingReplica != null)
+        if (!failureReasonByEndpoint.isEmpty())
         {
-            throw new TruncateException("Truncate failed on replica " + 
truncateFailingReplica);
+            // clone to make sure no race condition happens
+            Map<InetAddressAndPort, RequestFailureReason> 
failureReasonByEndpoint = new HashMap<>(this.failureReasonByEndpoint);
+            int size = failureReasonByEndpoint.size();
+            long timeouts = 
failureReasonByEndpoint.values().stream().filter(RequestFailureReason.TIMEOUT::equals).count();
+            long nonTimeout = size - timeouts;
+            if (nonTimeout <= timeouts)
+                throw new TimeoutException("Truncate timed out - received only 
" + responses.get() + " responses");

Review Comment:
   the bug reported was when 100% of the replies are timeout and we return the 
wrong exception type, so I wouldn't want to return a `TrucateException` in this 
case.
   
   Now, if `signal=true && failed`, I guess we have a more fundmental question 
of "how many replicas timeout for us to call it a timeout"?  I would prefer we 
are consistent so would want all read/write/truncate to match the logic...
   
   So, lets say we are dealing with a LOCAL_QUORUM read with rf=6, if 
failures=3 then we will fail, but lets say we have the following
   
   R1: TIMEOUT
   R2: TIMEOUT
   R3: UNKNOWN
   
   *should* this be a timeout, or should this be an error?  *should* this only 
ever be a timeout when 100% are timed out?  Personally I feel that if majority 
are timeout, we actually timed out.  @clohfink thoughts?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to