[ 
https://issues.apache.org/jira/browse/TRAFODION-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15876742#comment-15876742
 ] 

ASF GitHub Bot commented on TRAFODION-2235:
-------------------------------------------

Github user trinakrug commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/958#discussion_r102274683
  
    --- Diff: core/sqf/monitor/linux/zclient.cxx ---
    @@ -472,6 +469,65 @@ void CZClient::CheckCluster( void )
         TRACE_EXIT;
     }
     
    +void CZClient::CheckMyZNode( void )
    +{
    +    const char method_name[] = "CZClient::CheckMyZNode";
    +    TRACE_ENTRY;
    +
    +    int zerr;
    +    struct timespec currentTime;
    +
    +    if ( IsCheckCluster() )
    +    {
    +        if (resetMyZNodeFailedTime_)
    +        {
    +            resetMyZNodeFailedTime_ = false;
    +            clock_gettime(CLOCK_REALTIME, &myZNodeFailedTime_);
    +            myZNodeFailedTime_.tv_sec += (GetSessionTimeout() * 2);
    +            if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
    +            {
    +                trace_printf( "%s@%d" " - Resetting MyZnode Fail Time 
%ld(secs)\n"
    +                            , method_name, __LINE__
    +                            , myZNodeFailedTime_.tv_sec );
    +            }
    +        }
    +        if ( ! IsZNodeExpired( Node_name, zerr ) )
    +        {
    +            if ( zerr == ZCONNECTIONLOSS || zerr == ZOPERATIONTIMEOUT )
    +            {
    +                // Ignore transient errors with the quorum.
    +                // However, if longer than the session
    +                // timeout, handle it as a hard error.
    +                clock_gettime(CLOCK_REALTIME, &currentTime);
    +                if (currentTime.tv_sec > myZNodeFailedTime_.tv_sec)
    --- End diff --
    
    If resetMyZNodeFailedTime_ is true, then this if statement will always 
evaluate to true.   Just verifying that is the desired outcome.


> Enhance node failure detection and coordination
> -----------------------------------------------
>
>                 Key: TRAFODION-2235
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2235
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: foundation, installer
>    Affects Versions: 2.1-incubating
>            Reporter: Gonzalo E Correa
>            Assignee: Gonzalo E Correa
>
> Certain server and network failures are not detected by the monitor processes 
> which cause a safety net failure detection mechanism to trigger in all 
> Trafodion nodes. This safety net mechanism is controlled by the environment 
> variable  SQ_MON_SYNC_TIMEOUT currently set at 15 minutes.
> This JIRA is to enhance the node failure mechanism in the Trafodion 
> foundation components, specifically the monitor process, to detect a 
> non-responsive node and handle it as a node down condition when a 
> configurable timeout event is detected prior to the safety net failure 
> mechanism described above.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to