[ 
https://issues.apache.org/jira/browse/GEODE-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267534#comment-16267534
 ] 

ASF GitHub Bot commented on GEODE-3964:
---------------------------------------

bschuchardt commented on a change in pull request #1088: GEODE-3964: More 
logging for suspect processing.
URL: https://github.com/apache/geode/pull/1088#discussion_r153320760
 
 

 ##########
 File path: 
geode-core/src/main/java/org/apache/geode/distributed/internal/ReplyProcessor21.java
 ##########
 @@ -695,25 +698,32 @@ protected boolean basicWait(long msecs, 
StoppableCountDownLatch latch)
         if (timedOut || !latch.await(timeout - timeSoFar - 1)) {
           this.dmgr.getCancelCriterion().checkCancelInProgress(null);
 
-          // only start SUSPECT processing if severe alerts are enabled
-          timeout(isSevereAlertProcessingEnabled() && (severeAlertTimeout > 
0), false);
+          timeout(doSuspectProcessing, false);
 
           // If ack-severe-alert-threshold has been set, we now
           // wait for that period of time and then force the non-responding
           // members from the system. Then we wait indefinitely
-          if (isSevereAlertProcessingEnabled() && severeAlertTimeout > 0) {
-            boolean timedout;
+          if (doSuspectProcessing) {
+            boolean wasNotUnlatched;
             do {
               this.severeAlertTimerReset = false; // retry if this gets set by 
suspect processing
                                                   // (splitbrain requirement)
-              timedout = !latch.await(severeAlertTimeout);
-            } while (timedout && this.severeAlertTimerReset);
-            if (timedout) {
+              wasNotUnlatched = !latch.await(severeAlertTimeout);
+            } while (wasNotUnlatched && this.severeAlertTimerReset);
+            if (wasNotUnlatched) {
               this.dmgr.getCancelCriterion().checkCancelInProgress(null);
               timeout(false, true);
-              // for consistency, we must now wait for a membership view
-              // that ejects the removed members
-              latch.await();
+
+              long suspectProcessingErrorAlertTimeout = severeAlertTimeout * 3;
+              if (!latch.await(suspectProcessingErrorAlertTimeout)) {
+                long now = System.currentTimeMillis();
+                logger.fatal("Still waiting for suspect processing to complete 
after"
 
 Review comment:
   We should say how long the wait has been and which nodes haven't responded. 
Also, it would help customer support if the phrase had the word "elapsed" in it 
because that's one of the keywords they look for when assessing problems.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Add another severe-alert option
> -------------------------------
>
>                 Key: GEODE-3964
>                 URL: https://issues.apache.org/jira/browse/GEODE-3964
>             Project: Geode
>          Issue Type: Bug
>          Components: messaging
>            Reporter: Bruce Schuchardt
>
> Since suspect processing only commences when the ack-severe-alert-threshold 
> is reached it would be nice to have yet another alert if that processing 
> failed to kick out the slow-to-respond member and a thread is stuck for a 
> long time waiting for a reply.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to