[jira] [Commented] (FLUME-2118) Occasional multi-hour pauses in file channel replay

Hari Shreedharan (JIRA) Thu, 04 Jul 2013 00:43:34 -0700

    [ 
https://issues.apache.org/jira/browse/FLUME-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13699831#comment-13699831
 ]


Hari Shreedharan commented on FLUME-2118:
-----------------------------------------

I am fairly sure this is the thread that is causing issues:
"lifecycleSupervisor-1-0" prio=10 tid=0x00007fea505f7000 nid=0x279e runnable 
[0x00007fe84240d000]
   java.lang.Thread.State: RUNNABLE
        at 
org.apache.flume.channel.file.FlumeEventQueue.remove(FlumeEventQueue.java:195)
        - locked <0x00007fe84d0007b8> (a 
org.apache.flume.channel.file.FlumeEventQueue)
        at 
org.apache.flume.channel.file.ReplayHandler.processCommit(ReplayHandler.java:404)
        at 
org.apache.flume.channel.file.ReplayHandler.replayLog(ReplayHandler.java:327)
        at org.apache.flume.channel.file.Log.doReplay(Log.java:503)
        at org.apache.flume.channel.file.Log.replay(Log.java:430)
        at org.apache.flume.channel.file.FileChannel.start(FileChannel.java:301)
        - locked <0x00007fe84d000940> (a 
org.apache.flume.channel.file.FileChannel)
        at 
org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:251)
        - locked <0x00007fe84d000940> (a 
org.apache.flume.channel.file.FileChannel)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at 
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)


It is a known fact that FlumeEventQueue#remove() can be terribly slow if the 
channel has too many events (note that the thread is actually not stuck/waiting 
- it is still working. This is a linear scan of a mmaped-buffer, followed by 
removing the elements and moving all following elements one slot forward). I 
believe a good alternative would be to have a reverse index (or build one at 
startup - in most cases, this should pay off), and having a "replay" mode where 
the readjusting (moving all elements one slot forward) does not happen until 
after the replay is done. I can take a look at this if I get some time, but for 
now, I am completely swamped - won't be able to get to it for a while.
                
> Occasional multi-hour pauses in file channel replay
> ---------------------------------------------------
>
>                 Key: FLUME-2118
>                 URL: https://issues.apache.org/jira/browse/FLUME-2118
>             Project: Flume
>          Issue Type: Bug
>          Components: File Channel
>    Affects Versions: v1.5.0
>            Reporter: Juhani Connolly
>         Attachments: flume-log, flume-thread-dump, gc-flume.log.20130702
>
>
> Sometimes during replay, immediately after an EOF of one log, the replay will 
> pause for a long time.
> Here are two samples from this morning when we restarted our 3 aggregators 
> and 2 of them hit this issue.
> 02 7 2013 03:06:30,089 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2200000 
> records
> 02 7 2013 03:06:30,179 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2210000 
> records
> 02 7 2013 03:06:30,241 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.LogFile$SequentialReader.next:505)  - 
> Encountered EOF at 1623195625 in /data2/flume-data/log-1184
> 02 7 2013 06:23:27,629 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2220000 
> records
> 02 7 2013 06:23:28,641 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2230000 
> records
> 02 7 2013 06:23:29,162 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2240000 
> records
> 02 7 2013 06:23:30,118 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2250000 
> records
> 02 7 2013 06:23:30,750 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2260000 
> records
> 02 7 2013 08:03:00,942 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2160000 
> records
> 02 7 2013 08:03:01,055 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2170000 
> records
> 02 7 2013 08:03:01,168 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2180000 
> records
> 02 7 2013 08:03:01,181 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.LogFile$SequentialReader.next:505)  - 
> Encountered EOF at 1623195640 in /data2/flume-data/log-1182
> 02 7 2013 14:45:55,302 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2190000 
> records
> 02 7 2013 14:45:56,282 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2200000 
> records
> 02 7 2013 14:45:57,084 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2210000 
> records
> 02 7 2013 14:45:59,043 INFO  [lifecycleSupervisor-1-0] 
> (org.apache.flume.channel.file.ReplayHandler.replayLog:292)  - Read 2220000 
> records
> I've tried for an hour and some to track down the cause of this. There's 
> nothing suspicious turning up on ganglia, and a cursory review of the code 
> didn't turn up anything overly suspicious. Owing to time limitations I can't 
> dig further at this time.
> We run a version of flume from somewhat before the current 1.4 release 
> candidate(hash is eefefa941a60c0982f0957804be0cafb4d83e46e) there doesn't 
> appear to be any replay patches since then.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-2118) Occasional multi-hour pauses in file channel replay

Reply via email to