[
https://issues.apache.org/jira/browse/FLUME-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13132149#comment-13132149
]
Mingjie Lai commented on FLUME-808:
-----------------------------------
After looked at the source code, I found that the DirectDriver's lifecycle
management is within one thread. It indicate that DirectDriver cannot close
source/sink gracefully if the next() or append() is blocking. And most of
source/sinks are blocking.
The user guide's explanation makes more sense:
...close() This call is usually made from a separate thread than the open() or
next() calls...
An easy fix is to move source/sink.append() to main thread, to give
source/sink.close() a chance to exit next() gracefully. However it seems brings
impact to some existing source/sink which rely on interruption exception to
close it.
> Source/sink.close() is not invoked by DirectDriver if source.next() is
> blocking.
> --------------------------------------------------------------------------------
>
> Key: FLUME-808
> URL: https://issues.apache.org/jira/browse/FLUME-808
> Project: Flume
> Issue Type: Bug
> Components: Node, Sinks+Sources
> Affects Versions: v0.9.4
> Reporter: Mingjie Lai
>
> I just noticed an issue that DirectDriver doesn't really call the
> sink/source.close() at all, if flume node gets reconfigured.
> Very easy to reproduce:
> 1. start a master, and node at the same machine
> 2. configure the node thru master, with any configure, e.g.:
> tail("/tmp/aaa"), null
> 3. reconfigure the node thru master: collectorSource(12345), null
> The source.close() didn't get called at all for reconfigure, or rolled. And
> later on, Driver will decide to interrupt the thread since it seems not be
> able to be closed gracefully.
> {code}
> 2011-10-19 22:46:03,646 [Heartbeat] WARN agent.LivenessManager: Heartbeats
> are backing up, currently behind by 1 heartbeats
> 2011-10-19 22:46:08,648 [Heartbeat] WARN agent.LivenessManager: Heartbeats
> are backing up, currently behind by 2 heartbeats
> 2011-10-19 22:46:13,650 [Heartbeat] WARN agent.LivenessManager: Heartbeats
> are backing up, currently behind by 3 heartbeats
> 2011-10-19 22:46:18,652 [Heartbeat] WARN agent.LivenessManager: Heartbeats
> are backing up, currently behind by 4 heartbeats
> 2011-10-19 22:46:23,648 [Check config] ERROR agent.LogicalNode: Forcing
> driver to exit uncleanly
> 2011-10-19 22:46:23,648 [logicalNode c1-19] ERROR connector.DirectDriver:
> Closing down due to exception during append calls
> 2011-10-19 22:46:23,648 [Check config] INFO agent.LogicalNode: Node config
> successfully set to com.cloudera.flume.conf.FlumeConfigData@39a2f02e
> 2011-10-19 22:46:23,648 [logicalNode c1-19] INFO connector.DirectDriver:
> Connector logicalNode c1-19 exited with error: Waiting for queue element was
> interrupted! null
> {code}
> It may be brought by Flume-596 which is a big refactor of Driver. I tried to
> git-reset to 21b74010c34cef9a977c75ab5dec4dc747d8f5aa, and cannot reproduce
> the problem.
> Expected result -- source should be close()'ed
> {code}
> 2011-10-19 23:20:45,811 [SpawningLogicalNode c1] INFO
> collector.CollectorSource: closed
> 2011-10-19 23:20:45,812 [SpawningLogicalNode c1] INFO
> thrift.ThriftEventSource: Closed server on port 35853...
> 2011-10-19 23:20:45,817 [SpawningLogicalNode c1] INFO
> thrift.ThriftEventSource: Queue still has 0 elements ...
> 2011-10-19 23:20:45,852 [logicalNode c1-20] INFO collector.CollectorSource:
> closed
> 2011-10-19 23:20:45,852 [logicalNode c1-20] INFO thrift.ThriftEventSource:
> Closed server on port 35853...
> 2011-10-19 23:20:45,853 [logicalNode c1-20] INFO thrift.ThriftEventSource:
> Queue still has 0 elements ...
> 2011-10-19 23:20:45,853 [logicalNode c1-20] INFO collector.CollectorSource:
> closed
> 2011-10-19 23:20:45,853 [logicalNode c1-20] INFO thrift.ThriftEventSource:
> Closed server on port 35853...
> 2011-10-19 23:20:45,853 [logicalNode c1-20] INFO thrift.ThriftEventSource:
> Queue still has 0 elements ...
> 2011-10-19 23:20:45,853 [logicalNode c1-20] INFO agent.LogicalNode: c1:
> Connector stopped: CollectorSource | NullSink
> 2011-10-19 23:20:45,853 [SpawningLogicalNode c1] INFO agent.LogicalNode: Node
> config successfully set to com.cloudera.flume.conf.FlumeConfigData@2682d210
> 2011-10-19 23:20:45,864 [logicalNode c1-23] INFO agent.LogicalNode: Connector
> started: TailSource | NullSink
> {code}
> It might be the root cause of FLUME-798, and related to all the recent
> Interrupted exception discussion on user@.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira