[ 
https://issues.apache.org/jira/browse/FLUME-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13697620#comment-13697620
 ] 

Juhani Connolly commented on FLUME-2118:
----------------------------------------

Hari:

My colleague apparently restarted this several times. But from the look of 
things one of the checkpoints was corrupted and it was able to failover to the 
backup checkpoint. The actual capacity of the channel is 100million, and the 
reason it apparently stopped in the first place was because it got full(there 
was some issues with our hdfs cluster). New data is written at a rate of approx 
20k entries per second.

Here are relevant stats from after the replay completed

02 7 2013 06:23:52,661 INFO  [lifecycleSupervisor-1-0] 
(org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2510000 
records
02 7 2013 06:23:53,622 INFO  [lifecycleSupervisor-1-0] 
(org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2520000 
records
02 7 2013 06:23:53,677 INFO  [lifecycleSupervisor-1-0] 
(org.apache.flume.channel.file.LogFile$SequentialReader.next:505)  - 
Encountered EOF at 708801004 in /data2/flume-data/log-1185
02 7 2013 06:23:53,677 INFO  [lifecycleSupervisor-1-0] 
(org.apache.flume.channel.file.ReplayHandler.replayLog:338)  - read: 2525314, 
put: 154000, take: 154000, rollback: 0, commit: 728, skip: 2216586, 
eventCount:307161
02 7 2013 06:23:53,678 ERROR [lifecycleSupervisor-1-0] 
(org.apache.flume.channel.file.ReplayHandler.replayLog:372)  - Pending takes 
3000 exist after the end of replay. Duplicate messages will exist in 
destination.
02 7 2013 06:23:53,679 INFO  [lifecycleSupervisor-1-0] 
(org.apache.flume.channel.file.Log.replay:465)  - Rolling /data2/flume-data
02 7 2013 06:23:53,679 INFO  [lifecycleSupervisor-1-0] 
(org.apache.flume.channel.file.Log.roll:931)  - Roll start /data2/flume-data
02 7 2013 06:23:53,692 INFO  [lifecycleSupervisor-1-0] 
(org.apache.flume.channel.file.LogFile$Writer.<init>:169)  - Opened 
/data2/flume-data/log-1186
02 7 2013 06:23:53,714 INFO  [lifecycleSupervisor-1-0] 
(org.apache.flume.channel.file.Log.roll:946)  - Roll end
02 7 2013 06:23:53,714 INFO  [lifecycleSupervisor-1-0] 
(org.apache.flume.channel.file.EventQueueBackingStoreFile.beginCheckpoint:214)  
- Start checkpoint for /data1/flume-check/checkpoint, elements to sync = 155610
02 7 2013 06:23:53,773 INFO  [lifecycleSupervisor-1-0] 
(org.apache.flume.channel.file.EventQueueBackingStoreFile.checkpoint:239)  - 
Updating checkpoint metadata: logWriteOrderID: 1378348203273, queueSize: 
99997551, queueHead: 50234652
02 7 2013 06:23:53,869 INFO  [lifecycleSupervisor-1-0] 
(org.apache.flume.channel.file.EventQueueBackingStoreFile.startBackupThread:275)
  - Attempting to back up checkpoint.
02 7 2013 06:23:53,933 INFO  [lifecycleSupervisor-1-0] 
(org.apache.flume.channel.file.Log.writeCheckpoint:1006)  - Updated checkpoint 
for file: /data2/flume-data/log-1186 position: 0 logWriteOrderID: 1378348203273
02 7 2013 06:23:53,933 INFO  [lifecycleSupervisor-1-0] 
(org.apache.flume.channel.file.LogFile$RandomReader.close:354)  - Closing 
RandomReader /data2/flume-data/log-1157


Right now I'm kind of swamped in other work, so a profile is kind of difficult, 
but it does appear the GC was busy during that timeframe, attaching logs around 
it
                
> Occasional multi-hour pauses in file channel replay
> ---------------------------------------------------
>
>                 Key: FLUME-2118
>                 URL: https://issues.apache.org/jira/browse/FLUME-2118
>             Project: Flume
>          Issue Type: Bug
>          Components: File Channel
>    Affects Versions: v1.5.0
>            Reporter: Juhani Connolly
>
> Sometimes during replay, immediately after an EOF of one log, the replay will 
> pause for a long time.
> Here are two samples from this morning when we restarted our 3 aggregators 
> and 2 of them hit this issue.
> 02 7 2013 03:06:30,089 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2200000 
> records
> 02 7 2013 03:06:30,179 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2210000 
> records
> 02 7 2013 03:06:30,241 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.LogFile$SequentialReader.next:505)  - 
> Encountered EOF at 1623195625 in /data2/flume-data/log-1184
> 02 7 2013 06:23:27,629 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2220000 
> records
> 02 7 2013 06:23:28,641 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2230000 
> records
> 02 7 2013 06:23:29,162 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2240000 
> records
> 02 7 2013 06:23:30,118 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2250000 
> records
> 02 7 2013 06:23:30,750 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2260000 
> records
> 02 7 2013 08:03:00,942 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2160000 
> records
> 02 7 2013 08:03:01,055 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2170000 
> records
> 02 7 2013 08:03:01,168 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2180000 
> records
> 02 7 2013 08:03:01,181 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.LogFile$SequentialReader.next:505)  - 
> Encountered EOF at 1623195640 in /data2/flume-data/log-1182
> 02 7 2013 14:45:55,302 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2190000 
> records
> 02 7 2013 14:45:56,282 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2200000 
> records
> 02 7 2013 14:45:57,084 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2210000 
> records
> 02 7 2013 14:45:59,043 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2220000 
> records
> I've tried for an hour and some to track down the cause of this. There's 
> nothing suspicious turning up on ganglia, and a cursory review of the code 
> didn't turn up anything overly suspicious. Owing to time limitations I can't 
> dig further at this time.
> We run a version of flume from somewhat before the current 1.4 release 
> candidate(hash is eefefa941a60c0982f0957804be0cafb4d83e46e) there doesn't 
> appear to be any replay patches since then.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to