[ 
https://issues.apache.org/jira/browse/FLUME-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083001#comment-13083001
 ] 

Jonathan Hsieh commented on FLUME-706:
--------------------------------------

Ok, I think I have a sequence of events that create the bug.  There is an 
interaction between three threads that cause this.  CCT is the Check Config 
Thread (gets config data from queue),   HBT is the heartbeat thread (checks for 
spawning, checks for configs and enqueues them for CCT and other stuff not 
relevent to this) and a PT1 and PT2 (pumper/driver threads).  In this situation 
there is only one FlumeConfigData (FCD) passed that contains the ThriftSource 
spec and some arbitrary Sink spec.

CCT starts, blocks on empty queue
HBT makes rpcs to master to heartbeat
HBT calls checkLogicalNode, learns about new logicalnode
 HBT realizes logicalnode is new, 
 HBT gets FCD (thriftSource, sink), 
 HBT "spawns" the node by starting PT with FCD info.  // (this does not update 
the config version number)
PT starts, calling thriftSource.open, sink.open
PT enters thriftSource.append loop (shipping data from source to sink)
HBT checkLogicalNodeConfigs 
 HBT notices that it needs a new config 
 HBT fetches and enqueues FCD // (this actually has already been fetched by 
checkLogicalNode step) 
CCT unblocks dequeuing a flumeConfigData(FCD)
CCT calls logicalnode's checkConfig(FCD)
 CCT believes the last good FCD is (nullSource, nullSink at unixtime 0)  // 
(this is because we didn't update the version number earlier)
 CCT attempts to load the FCD because it thinks it is new.
  CCT instaintiates new instances of source and sink.
  CCT attempts to nicely shutdown previously PT. (via stop call on driver 
thread)
  CCT attempts to join on PT
  CCT times out on join and then issues a thread cancel (depending on where PT 
is, it may not catch this interruption)
  CCT attempts to start new direct driver thread (PT2)
PT2's open attempt fails because network port already bound (doesn't throw 
exception)
PT1 reaches a close call (doesn't close right away because queue is full of 
stuff)
  CCT finally sets last good config.

ThriftSource is in a closing state, and neither PT1 or PT2 are functioning 
properly.

There are a few paper cuts but a quick fix seems to be to properly set the 
version properly on during the spawn so that the second call that gets the same 
FCD  isn't initiated.

Another likely more robust approach is to make to make spawn happen in the 
single CCT thread instead of the HBT.


  

> Flume nodes launch duplicate logical nodes
> ------------------------------------------
>
>                 Key: FLUME-706
>                 URL: https://issues.apache.org/jira/browse/FLUME-706
>             Project: Flume
>          Issue Type: Bug
>          Components: Master, Node
>    Affects Versions: v0.9.5
>            Reporter: E. Sammer
>            Assignee: E. Sammer
>            Priority: Critical
>             Fix For: v0.9.5
>
>         Attachments: FLUME-706.log
>
>
> When submitting a config command to the flume master, it seems as if the 
> downstream node attempts to load the config twice.
> In a test case, starting a single master and a single node, I submitted a 
> "config node rpcSource(12345) console". The node sees the config change on 
> the next heartbeat and updates its config and starts the thrift source on 
> port 12345. Immediately after, it logs "Taking another heartbeat" (DEBUG) and 
> attempts to create another logical node with the same config. This leads to 
> thrift errors in bind() and "Could not create ServerSocket on address ...". 
> Looking at the root cause in a debugger (thrift swallows the original 
> exception) I can see it's an "Address already in use" IOException.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to