Github user zcorrea commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/958#discussion_r102332356
  
    --- Diff: core/sqf/monitor/linux/zclient.cxx ---
    @@ -472,6 +469,65 @@ void CZClient::CheckCluster( void )
         TRACE_EXIT;
     }
     
    +void CZClient::CheckMyZNode( void )
    +{
    +    const char method_name[] = "CZClient::CheckMyZNode";
    +    TRACE_ENTRY;
    +
    +    int zerr;
    +    struct timespec currentTime;
    +
    +    if ( IsCheckCluster() )
    +    {
    +        if (resetMyZNodeFailedTime_)
    +        {
    +            resetMyZNodeFailedTime_ = false;
    +            clock_gettime(CLOCK_REALTIME, &myZNodeFailedTime_);
    +            myZNodeFailedTime_.tv_sec += (GetSessionTimeout() * 2);
    +            if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
    +            {
    +                trace_printf( "%s@%d" " - Resetting MyZnode Fail Time 
%ld(secs)\n"
    +                            , method_name, __LINE__
    +                            , myZNodeFailedTime_.tv_sec );
    +            }
    +        }
    +        if ( ! IsZNodeExpired( Node_name, zerr ) )
    +        {
    +            if ( zerr == ZCONNECTIONLOSS || zerr == ZOPERATIONTIMEOUT )
    +            {
    +                // Ignore transient errors with the quorum.
    +                // However, if longer than the session
    +                // timeout, handle it as a hard error.
    +                clock_gettime(CLOCK_REALTIME, &currentTime);
    +                if (currentTime.tv_sec > myZNodeFailedTime_.tv_sec)
    --- End diff --
    
    The desired behavior is to continually reset the local nodes ZNode 
expiration time when the ZNode has not expired, which is the normal state. The 
only errors that can return from IsZNodeExpired and that are handled at this 
point are communication errors with the Zookeeper quorum, i.e., the connection 
loss and operation timeout. These errors can be transient, but if they persist 
beyond the myZNodeFailedTime they indicate that the local ZNode has gone beyond 
the local monitor's session expiration time window and the local monitor must 
bring itself down.
    
    So yes, setting resetMyZNodeFailedTime_ to true is to reset 
myZNodeFailedTime on each iteration, every zcMonitoringRateValue in seconds, to 
push the expiration time out until there is a communication failure with the 
Zookeeper quorum.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to