[ 
https://issues.apache.org/jira/browse/FLUME-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556253#comment-13556253
 ] 

Steve Hoffman commented on FLUME-1685:
--------------------------------------

New review created.  Old one discarded.

WRT the configurable flag, I also came across this issue: FLUME-979
I believe this is the more common case of a user deciding if they want the 
restart under "failure" conditions.  The difference being 979 deals with the 
exec'ed process dying and if flume should re-exec the command.  FLUME-1685 is 
about flume killing the exec'ed process because of a channel append error 
(full, broken, etc) leaving a restart of flume (and other channels it may be 
dealing with as well) the only recovery option for the channel.  As you said 
ExecSource can't roll back - but you have the data and nowhere to put it.  
Killing the source results in identical behavior - throwing the event away - 
except the channel will never recover w/o a restart (again effecting other 
channels).  I maintain this is undesirable behavior under any situation (hence 
my opposition to a configuration flag on this matter).

Here we plugged into metrics collection to alarm on when a channel fills (or 
better before it is full so we can maybe do something about it) rather than 
trying to watch the forked process.  We've had many occasion where the forked 
process appears to be running but not actually doing anything (aka generating 
events).  We use graphite not ganglia but the idea is the same.  Like us, they 
are probably watching the channel depth, rate data is being read at the source 
end and rate at which it comes out the sink end for an indicator of health.  
Beyond that, it'll be specific to the use case.

Yes, it doesn't hurt to add the flag, but the default you are suggesting seems 
non-obvious.
                
> ExecSource shouldn't die if the channel is full
> -----------------------------------------------
>
>                 Key: FLUME-1685
>                 URL: https://issues.apache.org/jira/browse/FLUME-1685
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: v1.2.0, v1.3.0, v1.4.0
>            Reporter: Steve Hoffman
>         Attachments: 
> 0001-FLUME-1685-don-t-kill-ExecSource-if-channel-is-full-branch-v1.2.0.patch, 
> 0001-FLUME-1685-don-t-kill-ExecSource-if-channel-is-full-trunk.patch
>
>
> Imagine this scenario.  You are using the ExecSource to tail a file and send 
> to a file channel.  When the file channel fills due to a temporary issue 
> downstream, the source gets a ChannelException which kills the source.
> {code}
> 2012-10-31 20:45:57,872 ERROR source.ExecSource: Failed while running 
> command: tail -F /tmp/test.log
> org.apache.flume.ChannelException: Unable to put batch on required channel: 
> FileChannel test { dataDirs: [/tmp/test/data] }
>         at 
> org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:195)
>         at 
> org.apache.flume.source.ExecSource$ExecRunnable.run(ExecSource.java:275)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.flume.ChannelException: Cannot acquire capacity. 
> [channel=hbasejson]
>         at 
> org.apache.flume.channel.file.FileChannel$FileBackedTransaction.doPut(FileChannel.java:346)
>         at 
> org.apache.flume.channel.BasicTransactionSemantics.put(BasicTransactionSemantics.java:93)
>         at 
> org.apache.flume.channel.BasicChannelSemantics.put(BasicChannelSemantics.java:76)
>         at 
> org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:184)
>         ... 7 more
> {code}
> The situation where the command being 'exec'ed fails/exits is already handled 
> with the existing retry logic.
> I suggest that when the source gets a ChannelException it throw the event 
> away (since there is nowhere to put it) and instead sleep for second and loop 
> again for another event.  If the channel is still throwing an exception 
> (still full), the event dropped and the sleep time doubled and we repeat 
> again.  There should be an upper bound on the retry time (say 128 seconds -- 
> about 2 minutes) for the next attempt.  When the putEvent no longer throws a 
> ChannelException, the "fallback" mode is reset and we read records at full 
> speed again.
> Clearly in a situation where the channel is full, data loss will happen.  But 
> in this case, we wouldn't have to restart the agent.  At scale this is an 
> administrative pain.  Even detecting this is difficult as the flume agent 
> itself is still running.  In this case (running a 'tail'), the tail will 
> eventually result in data loss should the file being tailed rotate.  
> Something has to give somewhere.
> I've got a patch I'm working on for this, but wanted to get the JIRA rolling 
> first.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to