[
https://issues.apache.org/jira/browse/FLUME-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841375#comment-13841375
]
Brock Noland commented on FLUME-2118:
-------------------------------------
Nevermind my last comment. I do think this scenario occurs most often when
dual checkpoint is not enabled because the slow remove() code hits much more
often during full replay.
We'll take this forward in FLUME-2155.
TL; DR: Enable dual checkpoint and you'll see this less
> Occasional multi-hour pauses in file channel replay
> ---------------------------------------------------
>
> Key: FLUME-2118
> URL: https://issues.apache.org/jira/browse/FLUME-2118
> Project: Flume
> Issue Type: Bug
> Components: File Channel
> Affects Versions: v1.5.0
> Reporter: Juhani Connolly
> Attachments: flume-log, flume-thread-dump, gc-flume.log.20130702
>
>
> Sometimes during replay, immediately after an EOF of one log, the replay will
> pause for a long time.
> Here are two samples from this morning when we restarted our 3 aggregators
> and 2 of them hit this issue.
> 02 7 2013 03:06:30,089 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2200000
> records
> 02 7 2013 03:06:30,179 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2210000
> records
> 02 7 2013 03:06:30,241 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.LogFile$SequentialReader.next:505) -
> Encountered EOF at 1623195625 in /data2/flume-data/log-1184
> 02 7 2013 06:23:27,629 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2220000
> records
> 02 7 2013 06:23:28,641 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2230000
> records
> 02 7 2013 06:23:29,162 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2240000
> records
> 02 7 2013 06:23:30,118 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2250000
> records
> 02 7 2013 06:23:30,750 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2260000
> records
> 02 7 2013 08:03:00,942 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2160000
> records
> 02 7 2013 08:03:01,055 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2170000
> records
> 02 7 2013 08:03:01,168 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2180000
> records
> 02 7 2013 08:03:01,181 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.LogFile$SequentialReader.next:505) -
> Encountered EOF at 1623195640 in /data2/flume-data/log-1182
> 02 7 2013 14:45:55,302 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2190000
> records
> 02 7 2013 14:45:56,282 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2200000
> records
> 02 7 2013 14:45:57,084 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2210000
> records
> 02 7 2013 14:45:59,043 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2220000
> records
> I've tried for an hour and some to track down the cause of this. There's
> nothing suspicious turning up on ganglia, and a cursory review of the code
> didn't turn up anything overly suspicious. Owing to time limitations I can't
> dig further at this time.
> We run a version of flume from somewhat before the current 1.4 release
> candidate(hash is eefefa941a60c0982f0957804be0cafb4d83e46e) there doesn't
> appear to be any replay patches since then.
--
This message was sent by Atlassian JIRA
(v6.1#6144)