[
https://issues.apache.org/jira/browse/TRAFODION-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15518653#comment-15518653
]
Arvind Narain commented on TRAFODION-2245:
------------------------------------------
Currently DCS Server checks if trafodion is up and running by calling sqcheck
before (re)starting mxosrvrs on a particular node. If trafodion is not running
then the watcher thread sleeps for increasing sleep duration before retrying
the check.
2016-09-22 13:20:25,178, ERROR, org.trafodion.dcs.server.ServerManager, Node
Number: , CPU: , PID: , Process Name: , , ,Trafodion is not running
2016-09-22 13:20:25,179, INFO, org.trafodion.dcs.util.RetryCounter, Node
Number: , CPU: , PID: , Process Name: , , ,Sleeping 10000ms before retry #1...
…
2016-09-22 13:20:40,250, ERROR, org.trafodion.dcs.server.ServerManager, Node
Number: , CPU: , PID: , Process Name: , , ,Trafodion is not running
2016-09-22 13:20:40,251, INFO, org.trafodion.dcs.util.RetryCounter, Node
Number: , CPU: , PID: , Process Name: , , ,Sleeping 20000ms before retry #2...
Selva has proposed a solution for this problem hence assigning this Jira to him.
"Add a new option sqcheck –n <nid> - This uses sqshell –c zone nid <nid>
instead of expensive regular sqcheck command.
Change dcsServer to check the status of node when it detects the server is gone
rather at the time of starting the server."
> Multiple sqcheck and jps processes running when monitor is downed and up as
> dcsserver checks if trafodion is up
> ---------------------------------------------------------------------------------------------------------------
>
> Key: TRAFODION-2245
> URL: https://issues.apache.org/jira/browse/TRAFODION-2245
> Project: Apache Trafodion
> Issue Type: Bug
> Components: dcs
> Affects Versions: 2.1-incubating
> Environment: Testing trafodion when failures occurred. HDP 2.4
> distro contents and a standard installation on CentOS 6
> Reporter: Carol Pearson
>
> Dcsserver checks if Trafodion is running by using sqcheck. That can hang in
> some circumstances
> In this case we had a DTM failure and recovery took a while. The node went to
> a SoftDown state as the DTM recovered. Meanwhile, dcsserver was looking for
> trafodion to come up so that it could start the mxosrvrs on that node. That
> resulted in many hung sqchecks - the notable symptom is that they all had the
> same ppid.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)