[jira] [Commented] (STORM-532) Supervisor should restart worker immediately, if the worker process does not exist any more

2014-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174863#comment-14174863
 ] 

ASF GitHub Bot commented on STORM-532:
--

Github user caofangkun commented on a diff in the pull request:

https://github.com/apache/storm/pull/296#discussion_r19007702
  
--- Diff: storm-core/src/clj/backtype/storm/util.clj ---
@@ -372,6 +372,13 @@
   (throw (RuntimeException. (str Got unexpected process name:  
name
 (first split)))
 
+(defn exists-process?
+   [process-id]  
+   (if on-windows?
+   (exec-command! ( str tasklist /v /fi \PID eq   process-id \))
--- End diff --

@itaifrenkel 
Thank you .
== Check if process id 21320 exists, should return 0 ==
C:\Users\***tasklist /v /fi PID eq 21320 | find /i 21320
cmd.exe  21320 Console1  3,184 
K Run
ning \\ 0:00:00 
***: C:\windows\system32\cmd.exe

C:\Users\***echo %errorlevel%
0

=== Check if process id 21322 exists, should return 1 ==
C:\Users\***tasklist /v /fi PID eq 21320 | find /i 21322

C:\Users\***echo %errorlevel%
1


 Supervisor should restart worker immediately, if the worker process does not 
 exist any more 
 

 Key: STORM-532
 URL: https://issues.apache.org/jira/browse/STORM-532
 Project: Apache Storm
  Issue Type: Improvement
Affects Versions: 0.10.0
Reporter: caofangkun
Priority: Minor

 For now 
 if the worker process does not exist any more 
 Supervisor will have to wait a few seconds for worker heartbeart timeout and 
 restart worker .
 If supervisor knows the worker processid  and check if the process exists in 
 the sync-processes thread ,may need less time to restart worker.
 1: record worker process id in the worker local heartbeart 
 2: in supervisor  sync-processes ,get process id from worker local heartbeat 
 and check if the process exits 
 3: if not restart it immediately



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] storm pull request: STORM-329 : buffer message in client and recon...

2014-10-17 Thread clockfly
Github user clockfly commented on the pull request:

https://github.com/apache/storm/pull/268#issuecomment-59487520
  
Support we are sending data from worker A to worker B, to solve 
STORM-404(Worker on one machine crashes due to a failure of another worker on 
another machine), 

I think we can adopt the following logics:

case1: when B is down:
1. B is lost, but A is still belive B is alive. 
2. A try to send data to B, and then it triggers reconnect
3. The Nimbus find B is lost, and notify A.
4. A got notification that B is down, it will need to interrupt the 
reconnection of step 2( by closing the connection)
5. The reconnection of step 2 is interuppted, it exit. it will not throw 
RuntimeException. 

The key change is at step 4. A need to interrupt the reconnection to an 
obsolete worker. 

case 2: when B is alive, but the connection from A to B is down
1. A trigger reconnection logic
2. reconnection timeout
3. A cannot handle this failure, A throws RuntimeException.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (STORM-329) Add Option to Config Message handling strategy when connection timeout

2014-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174874#comment-14174874
 ] 

ASF GitHub Bot commented on STORM-329:
--

Github user clockfly commented on the pull request:

https://github.com/apache/storm/pull/268#issuecomment-59487520
  
Support we are sending data from worker A to worker B, to solve 
STORM-404(Worker on one machine crashes due to a failure of another worker on 
another machine), 

I think we can adopt the following logics:

case1: when B is down:
1. B is lost, but A is still belive B is alive. 
2. A try to send data to B, and then it triggers reconnect
3. The Nimbus find B is lost, and notify A.
4. A got notification that B is down, it will need to interrupt the 
reconnection of step 2( by closing the connection)
5. The reconnection of step 2 is interuppted, it exit. it will not throw 
RuntimeException. 

The key change is at step 4. A need to interrupt the reconnection to an 
obsolete worker. 

case 2: when B is alive, but the connection from A to B is down
1. A trigger reconnection logic
2. reconnection timeout
3. A cannot handle this failure, A throws RuntimeException.


 Add Option to Config Message handling strategy when connection timeout
 --

 Key: STORM-329
 URL: https://issues.apache.org/jira/browse/STORM-329
 Project: Apache Storm
  Issue Type: Improvement
Affects Versions: 0.9.2-incubating
Reporter: Sean Zhong
Priority: Minor
  Labels: Netty
 Fix For: 0.9.2-incubating


 This is to address a [concern brought 
 up|https://github.com/apache/incubator-storm/pull/103#issuecomment-43632986] 
 during the work at STORM-297:
 {quote}
 [~revans2] wrote: Your logic makes since to me on why these calls are 
 blocking. My biggest concern around the blocking is in the case of a worker 
 crashing. If a single worker crashes this can block the entire topology from 
 executing until that worker comes back up. In some cases I can see that being 
 something that you would want. In other cases I can see speed being the 
 primary concern and some users would like to get partial data fast, rather 
 then accurate data later.
 Could we make it configurable on a follow up JIRA where we can have a max 
 limit to the buffering that is allowed, before we block, or throw data away 
 (which is what zeromq does)?
 {quote}
 If some worker crash suddenly, how to handle the message which was supposed 
 to be delivered to the worker?
 1. Should we buffer all message infinitely?
 2. Should we block the message sending until the connection is resumed?
 3. Should we config a buffer limit, try to buffer the message first, if the 
 limit is met, then block?
 4. Should we neither block, nor buffer too much, but choose to drop the 
 messages, and use the built-in storm failover mechanism? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (STORM-178) sending metrics to Ganglia

2014-10-17 Thread caofangkun (JIRA)

 [ 
https://issues.apache.org/jira/browse/STORM-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caofangkun updated STORM-178:
-
Description: 
provide something like ./conf/storm-metrics.properties 
and GangliaContext.java 

Collect and sent metrics to Ganglia


  was:
provide something like ./conf/storm-metrics.properties 
and GangliaContext.java 

Collect and sent metrics to Ganglia


 sending metrics to Ganglia
 --

 Key: STORM-178
 URL: https://issues.apache.org/jira/browse/STORM-178
 Project: Apache Storm
  Issue Type: Improvement
Reporter: caofangkun
Priority: Minor

 provide something like ./conf/storm-metrics.properties 
 and GangliaContext.java 
 Collect and sent metrics to Ganglia



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (STORM-178) sending metrics to Ganglia

2014-10-17 Thread caofangkun (JIRA)

 [ 
https://issues.apache.org/jira/browse/STORM-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

caofangkun updated STORM-178:
-
Attachment: storm-ganglia.png

 sending metrics to Ganglia
 --

 Key: STORM-178
 URL: https://issues.apache.org/jira/browse/STORM-178
 Project: Apache Storm
  Issue Type: Improvement
Reporter: caofangkun
Priority: Minor
 Attachments: storm-ganglia.png


 provide something like ./conf/storm-metrics.properties 
 and GangliaContext.java 
 Collect and sent metrics to Ganglia



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] storm pull request: STORM-532:Supervisor should restart worker imm...

2014-10-17 Thread itaifrenkel
Github user itaifrenkel commented on a diff in the pull request:

https://github.com/apache/storm/pull/296#discussion_r19012098
  
--- Diff: storm-core/src/clj/backtype/storm/util.clj ---
@@ -372,6 +372,13 @@
   (throw (RuntimeException. (str Got unexpected process name:  
name
 (first split)))
 
+(defn exists-process?
+   [process-id]  
+   (if on-windows?
+   (exec-command! ( str tasklist /v /fi \PID eq   process-id + \ 
 | find /i \ process-id \))
--- End diff --

On windows it's not that simple. See my SO post here:
http://stackoverflow.com/a/26423642/985297


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (STORM-532) Supervisor should restart worker immediately, if the worker process does not exist any more

2014-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174941#comment-14174941
 ] 

ASF GitHub Bot commented on STORM-532:
--

Github user itaifrenkel commented on a diff in the pull request:

https://github.com/apache/storm/pull/296#discussion_r19012098
  
--- Diff: storm-core/src/clj/backtype/storm/util.clj ---
@@ -372,6 +372,13 @@
   (throw (RuntimeException. (str Got unexpected process name:  
name
 (first split)))
 
+(defn exists-process?
+   [process-id]  
+   (if on-windows?
+   (exec-command! ( str tasklist /v /fi \PID eq   process-id + \ 
 | find /i \ process-id \))
--- End diff --

On windows it's not that simple. See my SO post here:
http://stackoverflow.com/a/26423642/985297


 Supervisor should restart worker immediately, if the worker process does not 
 exist any more 
 

 Key: STORM-532
 URL: https://issues.apache.org/jira/browse/STORM-532
 Project: Apache Storm
  Issue Type: Improvement
Affects Versions: 0.10.0
Reporter: caofangkun
Priority: Minor

 For now 
 if the worker process does not exist any more 
 Supervisor will have to wait a few seconds for worker heartbeart timeout and 
 restart worker .
 If supervisor knows the worker processid  and check if the process exists in 
 the sync-processes thread ,may need less time to restart worker.
 1: record worker process id in the worker local heartbeart 
 2: in supervisor  sync-processes ,get process id from worker local heartbeat 
 and check if the process exits 
 3: if not restart it immediately



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] storm pull request: STORM-532,Supervisor should restart worker imm...

2014-10-17 Thread itaifrenkel
Github user itaifrenkel commented on the pull request:

https://github.com/apache/storm/pull/293#issuecomment-59499800
  
Hi @caofangkun. Could you then close this pull request? 
Please see my comments in #296 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (STORM-532) Supervisor should restart worker immediately, if the worker process does not exist any more

2014-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174947#comment-14174947
 ] 

ASF GitHub Bot commented on STORM-532:
--

Github user itaifrenkel commented on the pull request:

https://github.com/apache/storm/pull/293#issuecomment-59499800
  
Hi @caofangkun. Could you then close this pull request? 
Please see my comments in #296 


 Supervisor should restart worker immediately, if the worker process does not 
 exist any more 
 

 Key: STORM-532
 URL: https://issues.apache.org/jira/browse/STORM-532
 Project: Apache Storm
  Issue Type: Improvement
Affects Versions: 0.10.0
Reporter: caofangkun
Priority: Minor

 For now 
 if the worker process does not exist any more 
 Supervisor will have to wait a few seconds for worker heartbeart timeout and 
 restart worker .
 If supervisor knows the worker processid  and check if the process exists in 
 the sync-processes thread ,may need less time to restart worker.
 1: record worker process id in the worker local heartbeart 
 2: in supervisor  sync-processes ,get process id from worker local heartbeat 
 and check if the process exits 
 3: if not restart it immediately



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] storm pull request: STORM-329 : buffer message in client and recon...

2014-10-17 Thread HeartSaVioR
Github user HeartSaVioR commented on the pull request:

https://github.com/apache/storm/pull/268#issuecomment-59529389
  
I have a question (maybe comment) about your PR.
(Since I don't know Storm deeply, so it could be wrong. Please correct me 
if I'm wrong!)

When we enqueue tuples to Client, queued tuples seems to be discarded when 
one worker is down and nimbus reassigns task to other worker, and finally 
worker changes task-socket relation.
But if we enqueue tuples to Drainer, queued tuples may could be sent to new 
worker when task - socket cache is changed to new.
If I'm right, it would be better to place flusher into TransferDrainer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (STORM-329) Add Option to Config Message handling strategy when connection timeout

2014-10-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/STORM-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14175144#comment-14175144
 ] 

ASF GitHub Bot commented on STORM-329:
--

Github user HeartSaVioR commented on the pull request:

https://github.com/apache/storm/pull/268#issuecomment-59529389
  
I have a question (maybe comment) about your PR.
(Since I don't know Storm deeply, so it could be wrong. Please correct me 
if I'm wrong!)

When we enqueue tuples to Client, queued tuples seems to be discarded when 
one worker is down and nimbus reassigns task to other worker, and finally 
worker changes task-socket relation.
But if we enqueue tuples to Drainer, queued tuples may could be sent to new 
worker when task - socket cache is changed to new.
If I'm right, it would be better to place flusher into TransferDrainer.


 Add Option to Config Message handling strategy when connection timeout
 --

 Key: STORM-329
 URL: https://issues.apache.org/jira/browse/STORM-329
 Project: Apache Storm
  Issue Type: Improvement
Affects Versions: 0.9.2-incubating
Reporter: Sean Zhong
Priority: Minor
  Labels: Netty
 Fix For: 0.9.2-incubating


 This is to address a [concern brought 
 up|https://github.com/apache/incubator-storm/pull/103#issuecomment-43632986] 
 during the work at STORM-297:
 {quote}
 [~revans2] wrote: Your logic makes since to me on why these calls are 
 blocking. My biggest concern around the blocking is in the case of a worker 
 crashing. If a single worker crashes this can block the entire topology from 
 executing until that worker comes back up. In some cases I can see that being 
 something that you would want. In other cases I can see speed being the 
 primary concern and some users would like to get partial data fast, rather 
 then accurate data later.
 Could we make it configurable on a follow up JIRA where we can have a max 
 limit to the buffering that is allowed, before we block, or throw data away 
 (which is what zeromq does)?
 {quote}
 If some worker crash suddenly, how to handle the message which was supposed 
 to be delivered to the worker?
 1. Should we buffer all message infinitely?
 2. Should we block the message sending until the connection is resumed?
 3. Should we config a buffer limit, try to buffer the message first, if the 
 limit is met, then block?
 4. Should we neither block, nor buffer too much, but choose to drop the 
 messages, and use the built-in storm failover mechanism? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)