[
https://issues.apache.org/jira/browse/TRAFODION-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15876742#comment-15876742
]
ASF GitHub Bot commented on TRAFODION-2235:
-------------------------------------------
Github user trinakrug commented on a diff in the pull request:
https://github.com/apache/incubator-trafodion/pull/958#discussion_r102274683
--- Diff: core/sqf/monitor/linux/zclient.cxx ---
@@ -472,6 +469,65 @@ void CZClient::CheckCluster( void )
TRACE_EXIT;
}
+void CZClient::CheckMyZNode( void )
+{
+ const char method_name[] = "CZClient::CheckMyZNode";
+ TRACE_ENTRY;
+
+ int zerr;
+ struct timespec currentTime;
+
+ if ( IsCheckCluster() )
+ {
+ if (resetMyZNodeFailedTime_)
+ {
+ resetMyZNodeFailedTime_ = false;
+ clock_gettime(CLOCK_REALTIME, &myZNodeFailedTime_);
+ myZNodeFailedTime_.tv_sec += (GetSessionTimeout() * 2);
+ if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
+ {
+ trace_printf( "%s@%d" " - Resetting MyZnode Fail Time
%ld(secs)\n"
+ , method_name, __LINE__
+ , myZNodeFailedTime_.tv_sec );
+ }
+ }
+ if ( ! IsZNodeExpired( Node_name, zerr ) )
+ {
+ if ( zerr == ZCONNECTIONLOSS || zerr == ZOPERATIONTIMEOUT )
+ {
+ // Ignore transient errors with the quorum.
+ // However, if longer than the session
+ // timeout, handle it as a hard error.
+ clock_gettime(CLOCK_REALTIME, ¤tTime);
+ if (currentTime.tv_sec > myZNodeFailedTime_.tv_sec)
--- End diff --
If resetMyZNodeFailedTime_ is true, then this if statement will always
evaluate to true. Just verifying that is the desired outcome.
> Enhance node failure detection and coordination
> -----------------------------------------------
>
> Key: TRAFODION-2235
> URL: https://issues.apache.org/jira/browse/TRAFODION-2235
> Project: Apache Trafodion
> Issue Type: Bug
> Components: foundation, installer
> Affects Versions: 2.1-incubating
> Reporter: Gonzalo E Correa
> Assignee: Gonzalo E Correa
>
> Certain server and network failures are not detected by the monitor processes
> which cause a safety net failure detection mechanism to trigger in all
> Trafodion nodes. This safety net mechanism is controlled by the environment
> variable SQ_MON_SYNC_TIMEOUT currently set at 15 minutes.
> This JIRA is to enhance the node failure mechanism in the Trafodion
> foundation components, specifically the monitor process, to detect a
> non-responsive node and handle it as a node down condition when a
> configurable timeout event is detected prior to the safety net failure
> mechanism described above.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)