[
https://issues.apache.org/jira/browse/TRAFODION-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15526900#comment-15526900
]
ASF GitHub Bot commented on TRAFODION-2245:
-------------------------------------------
Github user DaveBirdsall commented on a diff in the pull request:
https://github.com/apache/incubator-trafodion/pull/726#discussion_r80752396
--- Diff: dcs/src/main/java/org/trafodion/dcs/server/ServerManager.java ---
@@ -367,6 +335,36 @@ public ServerManager(Configuration conf, ZkClient zkc,
Constants.DEFAULT_DCS_SERVER_USER_PROGRAM_RESTART_HANDLER_RETRY_INTERVAL_MILLIS);
this.retryCounterFactory = new RetryCounterFactory(
this.maxRestartAttempts, this.retryIntervalMillis);
+ serverHandlers = new ServerHandler[this.childServers];
+ }
+
+ private static boolean isTrafodionRunning(String nid) {
+
+ // Check if the given Node is up and running
+ // return true else return false.
+ // Invoke sqcheck -n <nid> to check Node status.
+ //
+ // sqcheck returns:
+ // -1 - Not up ($?=255) or node down
+ // 0 - Fully up and operational or node up
+ // 1 - Partially up and operational
--- End diff --
I didn't see a code path in sqcheck that returned 1 or 2?
> Multiple sqcheck and jps processes running when monitor is downed and up as
> dcsserver checks if trafodion is up
> ---------------------------------------------------------------------------------------------------------------
>
> Key: TRAFODION-2245
> URL: https://issues.apache.org/jira/browse/TRAFODION-2245
> Project: Apache Trafodion
> Issue Type: Bug
> Components: dcs
> Affects Versions: 2.1-incubating
> Environment: Testing trafodion when failures occurred. HDP 2.4
> distro contents and a standard installation on CentOS 6
> Reporter: Carol Pearson
> Assignee: Selvaganesan Govindarajan
>
> Dcsserver checks if Trafodion is running by using sqcheck. That can hang in
> some circumstances
> In this case we had a DTM failure and recovery took a while. The node went to
> a SoftDown state as the DTM recovered. Meanwhile, dcsserver was looking for
> trafodion to come up so that it could start the mxosrvrs on that node. That
> resulted in many hung sqchecks - the notable symptom is that they all had the
> same ppid.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)