[ 
https://issues.apache.org/jira/browse/STORM-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15220008#comment-15220008
 ] 

ASF GitHub Bot commented on STORM-946:
--------------------------------------

Github user nicom commented on the pull request:

    https://github.com/apache/storm/pull/639#issuecomment-203982738
  
    @caofangkun 
    
    Sorry for commenting on such an old pull request, but we recently also 
encountered the problem this patch is trying to address.
    
    > Does have any situation will call (.close socket) but not removed from 
cached-node+port->socket ?
    
    Indeed there is one other situation where the close() method is called:
    
https://github.com/apache/storm/blob/v0.9.6/storm-core/src/jvm/backtype/storm/messaging/netty/Client.java#L505
    
    This only happens if the number of reconnects reaches the number set in the 
'storm.messaging.netty.max_retries' config setting (I believe the default is 
300). But the problem is aggravated by the fact, that the `connectionAttempts` 
counter is never reset. So if a topology runs for a long enough time this might 
become a problem.
    Actually this issue has been fixed in 0.10.0
    
    There is another more serious issue that lead to huge problem in one of out 
topologies whenever a worker crashed due to some exception.
    If worker A sucessfully connects to worker B for the first time during 
startup but worker B closes the connection for some reason before the 
`:worker-active-flag` is set to true (here 
https://github.com/apache/storm/blob/v0.9.6/storm-core/src/clj/backtype/storm/daemon/worker.clj#L356),
 there will be no further reconnect attempts, since no messages will be 
processed and neither `send()` nor `flushMessages()` will ever be called.
    
    In our case this frequenty happened because nimbus often rescheduled the 
whole toplogy right around the same time when the supervisor restarted the 
failed worker, leading to a race condition. This might have to do with the fact 
that `nimbus.task.timeout.secs` and `supervisor.worker.timeout.secs` have the 
same value by default.
    
    Tedxia's patch won't fix this problem I belive, since the status returned 
by `status()` will be `ConnectionWithStatus$Status/Connecting` in this 
situation.
    



> We should remove Closed Client form cached-node+port->socket in worker
> ----------------------------------------------------------------------
>
>                 Key: STORM-946
>                 URL: https://issues.apache.org/jira/browse/STORM-946
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>    Affects Versions: 0.10.0, 1.0.0
>            Reporter: xiajun
>
> The client may be Closed status after reconnect failed, and we will remove 
> closed client from Context to escape memory leak.
> But there is also reference for the closed Client in cached-node+port->socket 
> in worker, for this reason we should also remove closed Client from 
> cached-node+port->socket.  
> Meanwhile there is another reason for us to do so. Think about this 
> situation: worker A connect to worker B1 B2, but for some reason worker B1 B2 
> died at the same, then nimbus reschedule worker B1 B1. And new B1 B2 may 
> partly rescheduled at the some host:port as old B1 B2, that is (old B1: 
> host1+port1, old B2: host2+port2, new B1: host2+port2, new B2: host3+port3). 
> Worker A realized worker B1 B2 died and start reconnect to worker B1 B2, but 
> before new worker B1 and old B2 have the same host+port, and by the current 
> logic, we will remove old B1 Client and and create new Client for new worker 
> B2, and do nothing to old B2 and new B1 because they have the same host+port. 
> This will result the topology stop processing tuples. Once we remove closed 
> Client from cached-node+port->socket before refresh-connections, this  will 
> not happen again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to