[
https://issues.apache.org/jira/browse/FLUME-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13697620#comment-13697620
]
Juhani Connolly commented on FLUME-2118:
----------------------------------------
Hari:
My colleague apparently restarted this several times. But from the look of
things one of the checkpoints was corrupted and it was able to failover to the
backup checkpoint. The actual capacity of the channel is 100million, and the
reason it apparently stopped in the first place was because it got full(there
was some issues with our hdfs cluster). New data is written at a rate of approx
20k entries per second.
Here are relevant stats from after the replay completed
02 7 2013 06:23:52,661 INFO [lifecycleSupervisor-1-0]
(org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2510000
records
02 7 2013 06:23:53,622 INFO [lifecycleSupervisor-1-0]
(org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2520000
records
02 7 2013 06:23:53,677 INFO [lifecycleSupervisor-1-0]
(org.apache.flume.channel.file.LogFile$SequentialReader.next:505) -
Encountered EOF at 708801004 in /data2/flume-data/log-1185
02 7 2013 06:23:53,677 INFO [lifecycleSupervisor-1-0]
(org.apache.flume.channel.file.ReplayHandler.replayLog:338) - read: 2525314,
put: 154000, take: 154000, rollback: 0, commit: 728, skip: 2216586,
eventCount:307161
02 7 2013 06:23:53,678 ERROR [lifecycleSupervisor-1-0]
(org.apache.flume.channel.file.ReplayHandler.replayLog:372) - Pending takes
3000 exist after the end of replay. Duplicate messages will exist in
destination.
02 7 2013 06:23:53,679 INFO [lifecycleSupervisor-1-0]
(org.apache.flume.channel.file.Log.replay:465) - Rolling /data2/flume-data
02 7 2013 06:23:53,679 INFO [lifecycleSupervisor-1-0]
(org.apache.flume.channel.file.Log.roll:931) - Roll start /data2/flume-data
02 7 2013 06:23:53,692 INFO [lifecycleSupervisor-1-0]
(org.apache.flume.channel.file.LogFile$Writer.<init>:169) - Opened
/data2/flume-data/log-1186
02 7 2013 06:23:53,714 INFO [lifecycleSupervisor-1-0]
(org.apache.flume.channel.file.Log.roll:946) - Roll end
02 7 2013 06:23:53,714 INFO [lifecycleSupervisor-1-0]
(org.apache.flume.channel.file.EventQueueBackingStoreFile.beginCheckpoint:214)
- Start checkpoint for /data1/flume-check/checkpoint, elements to sync = 155610
02 7 2013 06:23:53,773 INFO [lifecycleSupervisor-1-0]
(org.apache.flume.channel.file.EventQueueBackingStoreFile.checkpoint:239) -
Updating checkpoint metadata: logWriteOrderID: 1378348203273, queueSize:
99997551, queueHead: 50234652
02 7 2013 06:23:53,869 INFO [lifecycleSupervisor-1-0]
(org.apache.flume.channel.file.EventQueueBackingStoreFile.startBackupThread:275)
- Attempting to back up checkpoint.
02 7 2013 06:23:53,933 INFO [lifecycleSupervisor-1-0]
(org.apache.flume.channel.file.Log.writeCheckpoint:1006) - Updated checkpoint
for file: /data2/flume-data/log-1186 position: 0 logWriteOrderID: 1378348203273
02 7 2013 06:23:53,933 INFO [lifecycleSupervisor-1-0]
(org.apache.flume.channel.file.LogFile$RandomReader.close:354) - Closing
RandomReader /data2/flume-data/log-1157
Right now I'm kind of swamped in other work, so a profile is kind of difficult,
but it does appear the GC was busy during that timeframe, attaching logs around
it
> Occasional multi-hour pauses in file channel replay
> ---------------------------------------------------
>
> Key: FLUME-2118
> URL: https://issues.apache.org/jira/browse/FLUME-2118
> Project: Flume
> Issue Type: Bug
> Components: File Channel
> Affects Versions: v1.5.0
> Reporter: Juhani Connolly
>
> Sometimes during replay, immediately after an EOF of one log, the replay will
> pause for a long time.
> Here are two samples from this morning when we restarted our 3 aggregators
> and 2 of them hit this issue.
> 02 7 2013 03:06:30,089 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2200000
> records
> 02 7 2013 03:06:30,179 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2210000
> records
> 02 7 2013 03:06:30,241 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.LogFile$SequentialReader.next:505) -
> Encountered EOF at 1623195625 in /data2/flume-data/log-1184
> 02 7 2013 06:23:27,629 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2220000
> records
> 02 7 2013 06:23:28,641 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2230000
> records
> 02 7 2013 06:23:29,162 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2240000
> records
> 02 7 2013 06:23:30,118 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2250000
> records
> 02 7 2013 06:23:30,750 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2260000
> records
> 02 7 2013 08:03:00,942 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2160000
> records
> 02 7 2013 08:03:01,055 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2170000
> records
> 02 7 2013 08:03:01,168 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2180000
> records
> 02 7 2013 08:03:01,181 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.LogFile$SequentialReader.next:505) -
> Encountered EOF at 1623195640 in /data2/flume-data/log-1182
> 02 7 2013 14:45:55,302 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2190000
> records
> 02 7 2013 14:45:56,282 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2200000
> records
> 02 7 2013 14:45:57,084 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2210000
> records
> 02 7 2013 14:45:59,043 INFO [lifecycleSupervisor-1-0]
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2220000
> records
> I've tried for an hour and some to track down the cause of this. There's
> nothing suspicious turning up on ganglia, and a cursory review of the code
> didn't turn up anything overly suspicious. Owing to time limitations I can't
> dig further at this time.
> We run a version of flume from somewhat before the current 1.4 release
> candidate(hash is eefefa941a60c0982f0957804be0cafb4d83e46e) there doesn't
> appear to be any replay patches since then.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira