Hey

We were having problems with our collectors dying (We always had errors in
the logs). We recently applied the patch
https://issues.apache.org/jira/browse/Flume-808 and modified the RollSink
TriggerThread to not Interrupt the append job when acquiring its lock. Our
collectors have now been up for a few days without problems.

On Thu, Oct 27, 2011 at 8:10 AM, Eran Kutner <e...@gigya.com> wrote:

> Just grepped a few days of logs and I don't see this error. It seems to be
> correlated with higher load on the HDFS servers (like when map/reduce jobs
> are running).
> When it is happening the agents fail to connect to the collectors, but I
> don't see any errors in the collectors logs. They just hang, while other
> virtual collectors on the same server continue to work.
>
> -eran
>
>
>
> On Thu, Oct 27, 2011 at 06:39, Eric Sammer <esam...@cloudera.com> wrote:
>
>> It's almost certainly the issue Mingjie mentioned. There's a race
>> condition in the rolling that's plagued a few people. I'm heads down
>> on NG but I think someone (probably Mingjie :)) was working on this.
>>
>>
>>
>> On Oct 26, 2011, at 1:59 PM, Mingjie Lai <mjla...@gmail.com> wrote:
>>
>> >
>> > Quite some ppl mentioned on the list recently that the combination of
>> RollSink + escapedCustomDfs causes issues. You may saw logs like these:
>> >
>> > 2011-10-17 17:30:07,190 [logicalNode collector0_log_dir-19] INFO
>> com.cloudera.flume.core.connector.DirectDriver - Connector logicalNode
>> collector0_log_dir-19 exited with error: Blocked append interrupted by
>> rotation event
>> > java.lang.InterruptedException: Blocked append interrupted by rotation
>> event
>> >        at
>> com.cloudera.flume.handlers.rolling.RollSink.append(RollSink.java:209)
>> >
>> >
>> > > 1500-2000 events per second
>> >
>> > It's not really a huge amount of data. Flume is expected to be able to
>> handle it.
>> >
>> > Not sure anyone is looking at it. Sorry.
>> >
>> > Mingjie
>> >
>> > On 10/23/2011 09:07 AM, Eran Kutner wrote:
>> >> Hi,
>> >> I'm having a problem where flume collectors occasionally stop working
>> >> under heavy load.
>> >> I'm writing something like 1500-2000 events per second to my
>> collectors,
>> >> and occasionally they will just stop working. Nothing is written to the
>> >> log the only indication that this is happening is that I see 0 messages
>> >> being delivered when looking in the flume stats web page  and events
>> >> start pilling up in the agents. Restarting the service solves the
>> >> problem for a while (anything from a few minutes to a few days).
>> >> An interesting thing to note is that this seems to be load related. It
>> >> used to happen a lot more but then I split the collector into three
>> >> virtual nodes and balanced the traffic on them and now it happens a lot
>> >> less. Also, while one virtual collector stops working the others, on
>> the
>> >> same machine, continue to work fine.
>> >>
>> >> My collector configuration looks like this: collectorSource(54001) |
>> >> collector(600000) {
>> >> escapedFormatDfs("hdfs://hadoop1-m1:8020/raw-events/%Y-%m-%d/",
>> >> "events-%{rolltag}-f01-c1.snappy", seqfile("SnappyCodec")) };
>> >>
>> >> I'm using 0.9.5 I've built a few weeks ago.
>> >>
>> >> Any ideas what can be causing it?
>> >>
>> >> -eran
>> >>
>>
>
>


-- 
Thanks

Cameron Gandevia

Reply via email to