[jira] [Commented] (STORM-532) Supervisor should restart worker immediately, if the worker process does not exist any more
[ https://issues.apache.org/jira/browse/STORM-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174863#comment-14174863 ] ASF GitHub Bot commented on STORM-532: -- Github user caofangkun commented on a diff in the pull request: https://github.com/apache/storm/pull/296#discussion_r19007702 --- Diff: storm-core/src/clj/backtype/storm/util.clj --- @@ -372,6 +372,13 @@ (throw (RuntimeException. (str Got unexpected process name: name (first split))) +(defn exists-process? + [process-id] + (if on-windows? + (exec-command! ( str tasklist /v /fi \PID eq process-id \)) --- End diff -- @itaifrenkel Thank you . == Check if process id 21320 exists, should return 0 == C:\Users\***tasklist /v /fi PID eq 21320 | find /i 21320 cmd.exe 21320 Console1 3,184 K Run ning \\ 0:00:00 ***: C:\windows\system32\cmd.exe C:\Users\***echo %errorlevel% 0 === Check if process id 21322 exists, should return 1 == C:\Users\***tasklist /v /fi PID eq 21320 | find /i 21322 C:\Users\***echo %errorlevel% 1 Supervisor should restart worker immediately, if the worker process does not exist any more Key: STORM-532 URL: https://issues.apache.org/jira/browse/STORM-532 Project: Apache Storm Issue Type: Improvement Affects Versions: 0.10.0 Reporter: caofangkun Priority: Minor For now if the worker process does not exist any more Supervisor will have to wait a few seconds for worker heartbeart timeout and restart worker . If supervisor knows the worker processid and check if the process exists in the sync-processes thread ,may need less time to restart worker. 1: record worker process id in the worker local heartbeart 2: in supervisor sync-processes ,get process id from worker local heartbeat and check if the process exits 3: if not restart it immediately -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-59487520 Support we are sending data from worker A to worker B, to solve STORM-404(Worker on one machine crashes due to a failure of another worker on another machine), I think we can adopt the following logics: case1: when B is down: 1. B is lost, but A is still belive B is alive. 2. A try to send data to B, and then it triggers reconnect 3. The Nimbus find B is lost, and notify A. 4. A got notification that B is down, it will need to interrupt the reconnection of step 2( by closing the connection) 5. The reconnection of step 2 is interuppted, it exit. it will not throw RuntimeException. The key change is at step 4. A need to interrupt the reconnection to an obsolete worker. case 2: when B is alive, but the connection from A to B is down 1. A trigger reconnection logic 2. reconnection timeout 3. A cannot handle this failure, A throws RuntimeException. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (STORM-329) Add Option to Config Message handling strategy when connection timeout
[ https://issues.apache.org/jira/browse/STORM-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174874#comment-14174874 ] ASF GitHub Bot commented on STORM-329: -- Github user clockfly commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-59487520 Support we are sending data from worker A to worker B, to solve STORM-404(Worker on one machine crashes due to a failure of another worker on another machine), I think we can adopt the following logics: case1: when B is down: 1. B is lost, but A is still belive B is alive. 2. A try to send data to B, and then it triggers reconnect 3. The Nimbus find B is lost, and notify A. 4. A got notification that B is down, it will need to interrupt the reconnection of step 2( by closing the connection) 5. The reconnection of step 2 is interuppted, it exit. it will not throw RuntimeException. The key change is at step 4. A need to interrupt the reconnection to an obsolete worker. case 2: when B is alive, but the connection from A to B is down 1. A trigger reconnection logic 2. reconnection timeout 3. A cannot handle this failure, A throws RuntimeException. Add Option to Config Message handling strategy when connection timeout -- Key: STORM-329 URL: https://issues.apache.org/jira/browse/STORM-329 Project: Apache Storm Issue Type: Improvement Affects Versions: 0.9.2-incubating Reporter: Sean Zhong Priority: Minor Labels: Netty Fix For: 0.9.2-incubating This is to address a [concern brought up|https://github.com/apache/incubator-storm/pull/103#issuecomment-43632986] during the work at STORM-297: {quote} [~revans2] wrote: Your logic makes since to me on why these calls are blocking. My biggest concern around the blocking is in the case of a worker crashing. If a single worker crashes this can block the entire topology from executing until that worker comes back up. In some cases I can see that being something that you would want. In other cases I can see speed being the primary concern and some users would like to get partial data fast, rather then accurate data later. Could we make it configurable on a follow up JIRA where we can have a max limit to the buffering that is allowed, before we block, or throw data away (which is what zeromq does)? {quote} If some worker crash suddenly, how to handle the message which was supposed to be delivered to the worker? 1. Should we buffer all message infinitely? 2. Should we block the message sending until the connection is resumed? 3. Should we config a buffer limit, try to buffer the message first, if the limit is met, then block? 4. Should we neither block, nor buffer too much, but choose to drop the messages, and use the built-in storm failover mechanism? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-178) sending metrics to Ganglia
[ https://issues.apache.org/jira/browse/STORM-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caofangkun updated STORM-178: - Description: provide something like ./conf/storm-metrics.properties and GangliaContext.java Collect and sent metrics to Ganglia was: provide something like ./conf/storm-metrics.properties and GangliaContext.java Collect and sent metrics to Ganglia sending metrics to Ganglia -- Key: STORM-178 URL: https://issues.apache.org/jira/browse/STORM-178 Project: Apache Storm Issue Type: Improvement Reporter: caofangkun Priority: Minor provide something like ./conf/storm-metrics.properties and GangliaContext.java Collect and sent metrics to Ganglia -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (STORM-178) sending metrics to Ganglia
[ https://issues.apache.org/jira/browse/STORM-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] caofangkun updated STORM-178: - Attachment: storm-ganglia.png sending metrics to Ganglia -- Key: STORM-178 URL: https://issues.apache.org/jira/browse/STORM-178 Project: Apache Storm Issue Type: Improvement Reporter: caofangkun Priority: Minor Attachments: storm-ganglia.png provide something like ./conf/storm-metrics.properties and GangliaContext.java Collect and sent metrics to Ganglia -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] storm pull request: STORM-532:Supervisor should restart worker imm...
Github user itaifrenkel commented on a diff in the pull request: https://github.com/apache/storm/pull/296#discussion_r19012098 --- Diff: storm-core/src/clj/backtype/storm/util.clj --- @@ -372,6 +372,13 @@ (throw (RuntimeException. (str Got unexpected process name: name (first split))) +(defn exists-process? + [process-id] + (if on-windows? + (exec-command! ( str tasklist /v /fi \PID eq process-id + \ | find /i \ process-id \)) --- End diff -- On windows it's not that simple. See my SO post here: http://stackoverflow.com/a/26423642/985297 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (STORM-532) Supervisor should restart worker immediately, if the worker process does not exist any more
[ https://issues.apache.org/jira/browse/STORM-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174941#comment-14174941 ] ASF GitHub Bot commented on STORM-532: -- Github user itaifrenkel commented on a diff in the pull request: https://github.com/apache/storm/pull/296#discussion_r19012098 --- Diff: storm-core/src/clj/backtype/storm/util.clj --- @@ -372,6 +372,13 @@ (throw (RuntimeException. (str Got unexpected process name: name (first split))) +(defn exists-process? + [process-id] + (if on-windows? + (exec-command! ( str tasklist /v /fi \PID eq process-id + \ | find /i \ process-id \)) --- End diff -- On windows it's not that simple. See my SO post here: http://stackoverflow.com/a/26423642/985297 Supervisor should restart worker immediately, if the worker process does not exist any more Key: STORM-532 URL: https://issues.apache.org/jira/browse/STORM-532 Project: Apache Storm Issue Type: Improvement Affects Versions: 0.10.0 Reporter: caofangkun Priority: Minor For now if the worker process does not exist any more Supervisor will have to wait a few seconds for worker heartbeart timeout and restart worker . If supervisor knows the worker processid and check if the process exists in the sync-processes thread ,may need less time to restart worker. 1: record worker process id in the worker local heartbeart 2: in supervisor sync-processes ,get process id from worker local heartbeat and check if the process exits 3: if not restart it immediately -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] storm pull request: STORM-532,Supervisor should restart worker imm...
Github user itaifrenkel commented on the pull request: https://github.com/apache/storm/pull/293#issuecomment-59499800 Hi @caofangkun. Could you then close this pull request? Please see my comments in #296 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (STORM-532) Supervisor should restart worker immediately, if the worker process does not exist any more
[ https://issues.apache.org/jira/browse/STORM-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174947#comment-14174947 ] ASF GitHub Bot commented on STORM-532: -- Github user itaifrenkel commented on the pull request: https://github.com/apache/storm/pull/293#issuecomment-59499800 Hi @caofangkun. Could you then close this pull request? Please see my comments in #296 Supervisor should restart worker immediately, if the worker process does not exist any more Key: STORM-532 URL: https://issues.apache.org/jira/browse/STORM-532 Project: Apache Storm Issue Type: Improvement Affects Versions: 0.10.0 Reporter: caofangkun Priority: Minor For now if the worker process does not exist any more Supervisor will have to wait a few seconds for worker heartbeart timeout and restart worker . If supervisor knows the worker processid and check if the process exists in the sync-processes thread ,may need less time to restart worker. 1: record worker process id in the worker local heartbeart 2: in supervisor sync-processes ,get process id from worker local heartbeat and check if the process exits 3: if not restart it immediately -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] storm pull request: STORM-329 : buffer message in client and recon...
Github user HeartSaVioR commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-59529389 I have a question (maybe comment) about your PR. (Since I don't know Storm deeply, so it could be wrong. Please correct me if I'm wrong!) When we enqueue tuples to Client, queued tuples seems to be discarded when one worker is down and nimbus reassigns task to other worker, and finally worker changes task-socket relation. But if we enqueue tuples to Drainer, queued tuples may could be sent to new worker when task - socket cache is changed to new. If I'm right, it would be better to place flusher into TransferDrainer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (STORM-329) Add Option to Config Message handling strategy when connection timeout
[ https://issues.apache.org/jira/browse/STORM-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14175144#comment-14175144 ] ASF GitHub Bot commented on STORM-329: -- Github user HeartSaVioR commented on the pull request: https://github.com/apache/storm/pull/268#issuecomment-59529389 I have a question (maybe comment) about your PR. (Since I don't know Storm deeply, so it could be wrong. Please correct me if I'm wrong!) When we enqueue tuples to Client, queued tuples seems to be discarded when one worker is down and nimbus reassigns task to other worker, and finally worker changes task-socket relation. But if we enqueue tuples to Drainer, queued tuples may could be sent to new worker when task - socket cache is changed to new. If I'm right, it would be better to place flusher into TransferDrainer. Add Option to Config Message handling strategy when connection timeout -- Key: STORM-329 URL: https://issues.apache.org/jira/browse/STORM-329 Project: Apache Storm Issue Type: Improvement Affects Versions: 0.9.2-incubating Reporter: Sean Zhong Priority: Minor Labels: Netty Fix For: 0.9.2-incubating This is to address a [concern brought up|https://github.com/apache/incubator-storm/pull/103#issuecomment-43632986] during the work at STORM-297: {quote} [~revans2] wrote: Your logic makes since to me on why these calls are blocking. My biggest concern around the blocking is in the case of a worker crashing. If a single worker crashes this can block the entire topology from executing until that worker comes back up. In some cases I can see that being something that you would want. In other cases I can see speed being the primary concern and some users would like to get partial data fast, rather then accurate data later. Could we make it configurable on a follow up JIRA where we can have a max limit to the buffering that is allowed, before we block, or throw data away (which is what zeromq does)? {quote} If some worker crash suddenly, how to handle the message which was supposed to be delivered to the worker? 1. Should we buffer all message infinitely? 2. Should we block the message sending until the connection is resumed? 3. Should we config a buffer limit, try to buffer the message first, if the limit is met, then block? 4. Should we neither block, nor buffer too much, but choose to drop the messages, and use the built-in storm failover mechanism? -- This message was sent by Atlassian JIRA (v6.3.4#6332)