[
https://issues.apache.org/jira/browse/FLUME-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13947060#comment-13947060
]
Mike Percy commented on FLUME-2055:
-----------------------------------
Hmm, we should consider making retry go on forever, with an exponential backoff
for retries. That would do a little more to solve the dead-Namenode problem...
although if the NN outage is brief (i.e. HA failover) or there are intermittent
network timeouts, or if a DN goes down, then FLUME-2007 addresses the issue.
Mark, as you say, in some cases having orphaned .tmp files is inevitable... for
example if you kill -9 the Flume process it will leave the .tmp files, as it
does not have a chance to roll the files. They have valid data, so Flume will
not retry the same data later. They should be manually renamed to non-tmp files
in that case. It may make sense to set up a cron job to troll for very old .tmp
files (say, 12 or 24 hours old) and rename them... depends on the use case.
It's pretty hard to know what the right thing to do is in all cases, so Flume
doesn't currently attempt to do that automatically.
> Flume leaves .tmp files in HDFS (unclosed?) after NameNode goes down
> --------------------------------------------------------------------
>
> Key: FLUME-2055
> URL: https://issues.apache.org/jira/browse/FLUME-2055
> Project: Flume
> Issue Type: Bug
> Components: Sinks+Sources
> Affects Versions: v1.3.0
> Reporter: Hari Sekhon
> Assignee: Ted Malaska
>
> NameNode was restarted while Flume was still running, resulting in .tmp files
> left in HDFS that weren't cleaned up that subsequently broke MapReduce with
> an error that implied the file wasn't closed properly:
> ERROR org.apache.hadoop.security.UserGroupInformation:
> PriviledgedActionException as:hdfs (auth:SIMPLE) cause:java.io.IOException:
> Cannot obtain block length for LocatedBlock{BP-1974494376-X.X.X.X
> Here is the Flume exception:
> ERROR [pool-9-thread-2] (org.apache.flume.source.AvroSource.appendBatch:302)
> - Avro source avro_source: Unable to process event batch. Exception follows.
> org.apache.flume.ChannelException: Unable to put batch on required channel:
> org.apache.flume.channel.MemoryChannel{name: hdfs_channel}
> at
> org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:200)
>
> at org.apache.flume.source.AvroSource.appendBatch(AvroSource.java:300)
> at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.avro.ipc.specific.SpecificResponder.respond(SpecificResponder.java:88)
>
> at org.apache.avro.ipc.Responder.respond(Responder.java:149)
> at
> org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.messageReceived(NettyServer.java:188)
>
> at
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:75)
>
> at
> org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.handleUpstream(NettyServer.java:173)
>
> at
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>
> at
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:792)
>
> at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
> at
> org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:321)
>
> at
> org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:303)
>
> at
> org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:220)
>
> at
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:75)
>
> at
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>
> at
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
>
> at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
> at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
> at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:94)
> at
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.processSelectedKeys(AbstractNioWorker.java:364)
>
> at
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:238)
>
> at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:38)
> at
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>
> at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.flume.ChannelException: Space for commit to queue
> couldn't be acquired Sinks are likely not keeping up with sources, or the
> buffer size is too tight
> at
> org.apache.flume.channel.MemoryChannel$MemoryTransaction.doCommit(MemoryChannel.java:128)
>
> at
> org.apache.flume.channel.BasicTransactionSemantics.commit(BasicTransactionSemantics.java:151)
>
> at
> org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:192)
>
> ... 28 more
--
This message was sent by Atlassian JIRA
(v6.2#6252)