[
https://issues.apache.org/jira/browse/TRAFODION-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15876842#comment-15876842
]
ASF GitHub Bot commented on TRAFODION-2235:
-------------------------------------------
Github user zcorrea commented on a diff in the pull request:
https://github.com/apache/incubator-trafodion/pull/958#discussion_r102332356
--- Diff: core/sqf/monitor/linux/zclient.cxx ---
@@ -472,6 +469,65 @@ void CZClient::CheckCluster( void )
TRACE_EXIT;
}
+void CZClient::CheckMyZNode( void )
+{
+ const char method_name[] = "CZClient::CheckMyZNode";
+ TRACE_ENTRY;
+
+ int zerr;
+ struct timespec currentTime;
+
+ if ( IsCheckCluster() )
+ {
+ if (resetMyZNodeFailedTime_)
+ {
+ resetMyZNodeFailedTime_ = false;
+ clock_gettime(CLOCK_REALTIME, &myZNodeFailedTime_);
+ myZNodeFailedTime_.tv_sec += (GetSessionTimeout() * 2);
+ if (trace_settings & (TRACE_INIT | TRACE_RECOVERY))
+ {
+ trace_printf( "%s@%d" " - Resetting MyZnode Fail Time
%ld(secs)\n"
+ , method_name, __LINE__
+ , myZNodeFailedTime_.tv_sec );
+ }
+ }
+ if ( ! IsZNodeExpired( Node_name, zerr ) )
+ {
+ if ( zerr == ZCONNECTIONLOSS || zerr == ZOPERATIONTIMEOUT )
+ {
+ // Ignore transient errors with the quorum.
+ // However, if longer than the session
+ // timeout, handle it as a hard error.
+ clock_gettime(CLOCK_REALTIME, ¤tTime);
+ if (currentTime.tv_sec > myZNodeFailedTime_.tv_sec)
--- End diff --
The desired behavior is to continually reset the local nodes ZNode
expiration time when the ZNode has not expired, which is the normal state. The
only errors that can return from IsZNodeExpired and that are handled at this
point are communication errors with the Zookeeper quorum, i.e., the connection
loss and operation timeout. These errors can be transient, but if they persist
beyond the myZNodeFailedTime they indicate that the local ZNode has gone beyond
the local monitor's session expiration time window and the local monitor must
bring itself down.
So yes, setting resetMyZNodeFailedTime_ to true is to reset
myZNodeFailedTime on each iteration, every zcMonitoringRateValue in seconds, to
push the expiration time out until there is a communication failure with the
Zookeeper quorum.
> Enhance node failure detection and coordination
> -----------------------------------------------
>
> Key: TRAFODION-2235
> URL: https://issues.apache.org/jira/browse/TRAFODION-2235
> Project: Apache Trafodion
> Issue Type: Bug
> Components: foundation, installer
> Affects Versions: 2.1-incubating
> Reporter: Gonzalo E Correa
> Assignee: Gonzalo E Correa
>
> Certain server and network failures are not detected by the monitor processes
> which cause a safety net failure detection mechanism to trigger in all
> Trafodion nodes. This safety net mechanism is controlled by the environment
> variable SQ_MON_SYNC_TIMEOUT currently set at 15 minutes.
> This JIRA is to enhance the node failure mechanism in the Trafodion
> foundation components, specifically the monitor process, to detect a
> non-responsive node and handle it as a node down condition when a
> configurable timeout event is detected prior to the safety net failure
> mechanism described above.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)