[
https://issues.apache.org/jira/browse/FLUME-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083001#comment-13083001
]
Jonathan Hsieh commented on FLUME-706:
--------------------------------------
Ok, I think I have a sequence of events that create the bug. There is an
interaction between three threads that cause this. CCT is the Check Config
Thread (gets config data from queue), HBT is the heartbeat thread (checks for
spawning, checks for configs and enqueues them for CCT and other stuff not
relevent to this) and a PT1 and PT2 (pumper/driver threads). In this situation
there is only one FlumeConfigData (FCD) passed that contains the ThriftSource
spec and some arbitrary Sink spec.
CCT starts, blocks on empty queue
HBT makes rpcs to master to heartbeat
HBT calls checkLogicalNode, learns about new logicalnode
HBT realizes logicalnode is new,
HBT gets FCD (thriftSource, sink),
HBT "spawns" the node by starting PT with FCD info. // (this does not update
the config version number)
PT starts, calling thriftSource.open, sink.open
PT enters thriftSource.append loop (shipping data from source to sink)
HBT checkLogicalNodeConfigs
HBT notices that it needs a new config
HBT fetches and enqueues FCD // (this actually has already been fetched by
checkLogicalNode step)
CCT unblocks dequeuing a flumeConfigData(FCD)
CCT calls logicalnode's checkConfig(FCD)
CCT believes the last good FCD is (nullSource, nullSink at unixtime 0) //
(this is because we didn't update the version number earlier)
CCT attempts to load the FCD because it thinks it is new.
CCT instaintiates new instances of source and sink.
CCT attempts to nicely shutdown previously PT. (via stop call on driver
thread)
CCT attempts to join on PT
CCT times out on join and then issues a thread cancel (depending on where PT
is, it may not catch this interruption)
CCT attempts to start new direct driver thread (PT2)
PT2's open attempt fails because network port already bound (doesn't throw
exception)
PT1 reaches a close call (doesn't close right away because queue is full of
stuff)
CCT finally sets last good config.
ThriftSource is in a closing state, and neither PT1 or PT2 are functioning
properly.
There are a few paper cuts but a quick fix seems to be to properly set the
version properly on during the spawn so that the second call that gets the same
FCD isn't initiated.
Another likely more robust approach is to make to make spawn happen in the
single CCT thread instead of the HBT.
> Flume nodes launch duplicate logical nodes
> ------------------------------------------
>
> Key: FLUME-706
> URL: https://issues.apache.org/jira/browse/FLUME-706
> Project: Flume
> Issue Type: Bug
> Components: Master, Node
> Affects Versions: v0.9.5
> Reporter: E. Sammer
> Assignee: E. Sammer
> Priority: Critical
> Fix For: v0.9.5
>
> Attachments: FLUME-706.log
>
>
> When submitting a config command to the flume master, it seems as if the
> downstream node attempts to load the config twice.
> In a test case, starting a single master and a single node, I submitted a
> "config node rpcSource(12345) console". The node sees the config change on
> the next heartbeat and updates its config and starts the thrift source on
> port 12345. Immediately after, it logs "Taking another heartbeat" (DEBUG) and
> attempts to create another logical node with the same config. This leads to
> thrift errors in bind() and "Could not create ServerSocket on address ...".
> Looking at the root cause in a debugger (thrift swallows the original
> exception) I can see it's an "Address already in use" IOException.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira