[ 
https://issues.apache.org/jira/browse/FLUME-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13698082#comment-13698082
 ] 

Hari Shreedharan commented on FLUME-2118:
-----------------------------------------

Also, another experiment you could do is to use fast replay. To do that, first 
copy your checkpoint and data files, then delete the original checkpoint and 
the  backup (the backup that the channel creates, not the one you just copied 
off). Then set use-fast-replay=true in the channel configuration, and start the 
agent. You should give the agent as large a heap as you can - of the order of a 
20G or so. Let me know if this is faster.


We have a known issue where full replays can be painfully slow if the channel 
is full. Fast replay is meant to work around that issue, but at the cost of 
using more heap. 

Also, could you post the logs where replay was just starting. I am curious to 
know why the channel did not start up from the checkpoint (or if it did). It 
would be great if you could post the entire replay log, so we can see if we 
skipped the data files to the checkpoint location when it restarted (you can 
strip the "Read * records" events though).
                
> Occasional multi-hour pauses in file channel replay
> ---------------------------------------------------
>
>                 Key: FLUME-2118
>                 URL: https://issues.apache.org/jira/browse/FLUME-2118
>             Project: Flume
>          Issue Type: Bug
>          Components: File Channel
>    Affects Versions: v1.5.0
>            Reporter: Juhani Connolly
>         Attachments: gc-flume.log.20130702
>
>
> Sometimes during replay, immediately after an EOF of one log, the replay will 
> pause for a long time.
> Here are two samples from this morning when we restarted our 3 aggregators 
> and 2 of them hit this issue.
> 02 7 2013 03:06:30,089 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2200000 
> records
> 02 7 2013 03:06:30,179 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2210000 
> records
> 02 7 2013 03:06:30,241 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.LogFile$SequentialReader.next:505)  - 
> Encountered EOF at 1623195625 in /data2/flume-data/log-1184
> 02 7 2013 06:23:27,629 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2220000 
> records
> 02 7 2013 06:23:28,641 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2230000 
> records
> 02 7 2013 06:23:29,162 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2240000 
> records
> 02 7 2013 06:23:30,118 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2250000 
> records
> 02 7 2013 06:23:30,750 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2260000 
> records
> 02 7 2013 08:03:00,942 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2160000 
> records
> 02 7 2013 08:03:01,055 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2170000 
> records
> 02 7 2013 08:03:01,168 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2180000 
> records
> 02 7 2013 08:03:01,181 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.LogFile$SequentialReader.next:505)  - 
> Encountered EOF at 1623195640 in /data2/flume-data/log-1182
> 02 7 2013 14:45:55,302 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2190000 
> records
> 02 7 2013 14:45:56,282 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2200000 
> records
> 02 7 2013 14:45:57,084 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2210000 
> records
> 02 7 2013 14:45:59,043 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2220000 
> records
> I've tried for an hour and some to track down the cause of this. There's 
> nothing suspicious turning up on ganglia, and a cursory review of the code 
> didn't turn up anything overly suspicious. Owing to time limitations I can't 
> dig further at this time.
> We run a version of flume from somewhat before the current 1.4 release 
> candidate(hash is eefefa941a60c0982f0957804be0cafb4d83e46e) there doesn't 
> appear to be any replay patches since then.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to