[ 
https://issues.apache.org/jira/browse/TRAFODION-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15517422#comment-15517422
 ] 

Gonzalo E Correa commented on TRAFODION-2235:
---------------------------------------------

Monitor changes:
 
• Added monitor sync thread epoll timeout logic
• Added Zookeeper client thread to the monitor
 
Configuration changes in ‘sqenvcom.sh’:
 
# Monitor sync thread epoll wait timeout is in seconds
# Currently set to 45 seconds
export SQ_MON_EPOLL_WAIT_TIMEOUT=15
export SQ_MON_EPOLL_RETRY_COUNT=3
 
# Monitor Zookeeper client
# - A zero value disables the zclient logic in the monitor process.
# It is enabled by default in a real cluster, disabled otherwise.
# (must be disabled to debug monitor processes in a real cluster)
#export SQ_MON_ZCLIENT_ENABLED=0
# - Session timeout in seconds defines when Zookeeper quorum determines a
# non-responsive monitor zclient which results in a Trafodion node down.
# Default is 60 seconds (1 minute) which is the maximum Zookeeper allows.
# (15 seconds longer than EPOLL timeout above).
#export SQ_MON_ZCLIENT_SESSION_TIMEOUT=60
 
The EPOLL timeout expires at 45 seconds, meaning that the monitor to monitor 
communication in the sync thread detects that one of the monitor processes has 
not completed the IO. This triggers a node down on that corresponding monitor 
in all monitor that detect the timeout (the assumption is that all monitors 
detect the timeout except the monitor that can’t talk for whatever reason). 
However, the node down processing will remove the znode of the 
corresponding node ensuring that if it is able to continue, it will detect
that its session has expired and will also execute node processing.
 
The ZCLIENT SESSION TIMEOUT expires at 60 seconds, meaning that the ZClient 
logic in the monitor has not sent an I’m Alive message to the Zookeeper quorum 
and the quorum declares the corresponding znode as expired. This also triggers 
a node down on that corresponding node in all monitor including the monitor 
that can’t send the I’m Alive to the quorum (if it is still alive).

> Enhance node failure detection and coordination
> -----------------------------------------------
>
>                 Key: TRAFODION-2235
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2235
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: foundation, installer
>    Affects Versions: 2.1-incubating
>            Reporter: Gonzalo E Correa
>            Assignee: Gonzalo E Correa
>
> Certain server and network failures are not detected by the monitor processes 
> which cause a safety net failure detection mechanism to trigger in all 
> Trafodion nodes. This safety net mechanism is controlled by the environment 
> variable  SQ_MON_SYNC_TIMEOUT currently set at 15 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to