[jira] [Commented] (TRAFODION-2235) Enhance node failure detection and coordination

ASF GitHub Bot (JIRA) Tue, 21 Feb 2017 14:15:07 -0800

    [ 
https://issues.apache.org/jira/browse/TRAFODION-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15876842#comment-15876842
 ]


ASF GitHub Bot commented on TRAFODION-2235:
-------------------------------------------

Github user zcorrea commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/958#discussion_r102332356
  
    --- Diff: core/sqf/monitor/linux/zclient.cxx ---
    @@ -472,6 +469,65 @@ void CZClient::CheckCluster( void )
         TRACE_EXIT;
     }
     
    +void CZClient::CheckMyZNode( void )
    +{
    +    const char method_name[] = "CZClient::CheckMyZNode";
    +    TRACE_ENTRY;
    +
    +    int zerr;
    +    struct timespec currentTime;
    +
    +    if ( IsCheckCluster() )
    +    {
    +        if (resetMyZNodeFailedTime_)
    +        {
    +            resetMyZNodeFailedTime_ = false;
    +            clock_gettime(CLOCK_REALTIME, &myZNodeFailedTime_);
    +            myZNodeFailedTime_.tv_sec += (GetSessionTimeout() * 2);
    +            if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
    +            {
    +                trace_printf( "%s@%d" " - Resetting MyZnode Fail Time 
%ld(secs)\n"
    +                            , method_name, __LINE__
    +                            , myZNodeFailedTime_.tv_sec );
    +            }
    +        }
    +        if ( ! IsZNodeExpired( Node_name, zerr ) )
    +        {
    +            if ( zerr == ZCONNECTIONLOSS || zerr == ZOPERATIONTIMEOUT )
    +            {
    +                // Ignore transient errors with the quorum.
    +                // However, if longer than the session
    +                // timeout, handle it as a hard error.
    +                clock_gettime(CLOCK_REALTIME, &currentTime);
    +                if (currentTime.tv_sec > myZNodeFailedTime_.tv_sec)
    --- End diff --
    
    The desired behavior is to continually reset the local nodes ZNode 
expiration time when the ZNode has not expired, which is the normal state. The 
only errors that can return from IsZNodeExpired and that are handled at this 
point are communication errors with the Zookeeper quorum, i.e., the connection 
loss and operation timeout. These errors can be transient, but if they persist 
beyond the myZNodeFailedTime they indicate that the local ZNode has gone beyond 
the local monitor's session expiration time window and the local monitor must 
bring itself down.
    
    So yes, setting resetMyZNodeFailedTime_ to true is to reset 
myZNodeFailedTime on each iteration, every zcMonitoringRateValue in seconds, to 
push the expiration time out until there is a communication failure with the 
Zookeeper quorum.


> Enhance node failure detection and coordination
> -----------------------------------------------
>
>                 Key: TRAFODION-2235
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2235
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: foundation, installer
>    Affects Versions: 2.1-incubating
>            Reporter: Gonzalo E Correa
>            Assignee: Gonzalo E Correa
>
> Certain server and network failures are not detected by the monitor processes 
> which cause a safety net failure detection mechanism to trigger in all 
> Trafodion nodes. This safety net mechanism is controlled by the environment 
> variable  SQ_MON_SYNC_TIMEOUT currently set at 15 minutes.
> This JIRA is to enhance the node failure mechanism in the Trafodion 
> foundation components, specifically the monitor process, to detect a 
> non-responsive node and handle it as a node down condition when a 
> configurable timeout event is detected prior to the safety net failure 
> mechanism described above.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (TRAFODION-2235) Enhance node failure detection and coordination

Reply via email to