[jira] [Commented] (TRAFODION-2245) Multiple sqcheck and jps processes running when monitor is downed and up as dcsserver checks if trafodion is up

Arvind Narain (JIRA) Sat, 24 Sep 2016 01:12:00 -0700

    [ 
https://issues.apache.org/jira/browse/TRAFODION-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15518653#comment-15518653
 ]


Arvind Narain commented on TRAFODION-2245:
------------------------------------------

Currently DCS Server checks if trafodion is up and running by calling sqcheck 
before (re)starting mxosrvrs on a particular node. If trafodion is not running 
then the watcher thread sleeps for increasing sleep duration before retrying 
the check.

2016-09-22 13:20:25,178, ERROR, org.trafodion.dcs.server.ServerManager, Node 
Number: , CPU: , PID: , Process Name: , , ,Trafodion is not running
2016-09-22 13:20:25,179, INFO, org.trafodion.dcs.util.RetryCounter, Node 
Number: , CPU: , PID: , Process Name: , , ,Sleeping 10000ms before retry #1...
 
…
2016-09-22 13:20:40,250, ERROR, org.trafodion.dcs.server.ServerManager, Node 
Number: , CPU: , PID: , Process Name: , , ,Trafodion is not running
2016-09-22 13:20:40,251, INFO, org.trafodion.dcs.util.RetryCounter, Node 
Number: , CPU: , PID: , Process Name: , , ,Sleeping 20000ms before retry #2...


Selva has proposed a solution for this problem hence assigning this Jira to him.

"Add a new option sqcheck –n <nid>  - This uses sqshell –c zone nid <nid> 
instead of expensive regular sqcheck command.
Change dcsServer to check the status of node when it detects the server is gone 
rather at the time of starting the server."




> Multiple sqcheck and jps processes running when monitor is downed and up as 
> dcsserver checks if trafodion is up
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: TRAFODION-2245
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2245
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: dcs
>    Affects Versions: 2.1-incubating
>         Environment: Testing trafodion when failures occurred.  HDP 2.4 
> distro contents and a standard installation on CentOS 6
>            Reporter: Carol Pearson
>
> Dcsserver checks if Trafodion is running by using sqcheck.  That can hang in 
> some circumstances 
> In this case we had a DTM failure and recovery took a while. The node went to 
> a SoftDown state as the DTM recovered.  Meanwhile, dcsserver was looking for 
> trafodion to come up so that it could start the mxosrvrs on that node.  That 
> resulted in many hung sqchecks - the notable symptom is that they all had the 
> same ppid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TRAFODION-2245) Multiple sqcheck and jps processes running when monitor is downed and up as dcsserver checks if trafodion is up

Reply via email to