[
https://issues.apache.org/jira/browse/FLUME-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13697627#comment-13697627
]
Juhani Connolly commented on FLUME-2118:
----------------------------------------
It appears that the BadCheckpointException was only in the logs from a
different attempt at restarting(colleague moved the checkpoint thinking it
might fix the freeze and tried running without it). On the actual run that
worked all the way through, there was no Exception
Further, the "freeze" would occur every time after the EOF of the same
files(log-1184 and log-1182) so either relates to them or the following files.
> Occasional multi-hour pauses in file channel replay
> ---------------------------------------------------
>
> Key: FLUME-2118
> URL: https://issues.apache.org/jira/browse/FLUME-2118
> Project: Flume
> Issue Type: Bug
> Components: File Channel
> Affects Versions: v1.5.0
> Reporter: Juhani Connolly
> Attachments: gc-flume.log.20130702
>
>
> Sometimes during replay, immediately after an EOF of one log, the replay will
> pause for a long time.
> Here are two samples from this morning when we restarted our 3 aggregators
> and 2 of them hit this issue.
> 02 7 2013 03:06:30,089 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2200000
> records
> 02 7 2013 03:06:30,179 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2210000
> records
> 02 7 2013 03:06:30,241 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.LogFile$SequentialReader.next:505) -
> Encountered EOF at 1623195625 in /data2/flume-data/log-1184
> 02 7 2013 06:23:27,629 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2220000
> records
> 02 7 2013 06:23:28,641 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2230000
> records
> 02 7 2013 06:23:29,162 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2240000
> records
> 02 7 2013 06:23:30,118 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2250000
> records
> 02 7 2013 06:23:30,750 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2260000
> records
> 02 7 2013 08:03:00,942 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2160000
> records
> 02 7 2013 08:03:01,055 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2170000
> records
> 02 7 2013 08:03:01,168 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2180000
> records
> 02 7 2013 08:03:01,181 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.LogFile$SequentialReader.next:505) -
> Encountered EOF at 1623195640 in /data2/flume-data/log-1182
> 02 7 2013 14:45:55,302 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2190000
> records
> 02 7 2013 14:45:56,282 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2200000
> records
> 02 7 2013 14:45:57,084 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2210000
> records
> 02 7 2013 14:45:59,043 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2220000
> records
> I've tried for an hour and some to track down the cause of this. There's
> nothing suspicious turning up on ganglia, and a cursory review of the code
> didn't turn up anything overly suspicious. Owing to time limitations I can't
> dig further at this time.
> We run a version of flume from somewhat before the current 1.4 release
> candidate(hash is eefefa941a60c0982f0957804be0cafb4d83e46e) there doesn't
> appear to be any replay patches since then.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira