Hey We were having problems with our collectors dying (We always had errors in the logs). We recently applied the patch https://issues.apache.org/jira/browse/Flume-808 and modified the RollSink TriggerThread to not Interrupt the append job when acquiring its lock. Our collectors have now been up for a few days without problems.
On Thu, Oct 27, 2011 at 8:10 AM, Eran Kutner <e...@gigya.com> wrote: > Just grepped a few days of logs and I don't see this error. It seems to be > correlated with higher load on the HDFS servers (like when map/reduce jobs > are running). > When it is happening the agents fail to connect to the collectors, but I > don't see any errors in the collectors logs. They just hang, while other > virtual collectors on the same server continue to work. > > -eran > > > > On Thu, Oct 27, 2011 at 06:39, Eric Sammer <esam...@cloudera.com> wrote: > >> It's almost certainly the issue Mingjie mentioned. There's a race >> condition in the rolling that's plagued a few people. I'm heads down >> on NG but I think someone (probably Mingjie :)) was working on this. >> >> >> >> On Oct 26, 2011, at 1:59 PM, Mingjie Lai <mjla...@gmail.com> wrote: >> >> > >> > Quite some ppl mentioned on the list recently that the combination of >> RollSink + escapedCustomDfs causes issues. You may saw logs like these: >> > >> > 2011-10-17 17:30:07,190 [logicalNode collector0_log_dir-19] INFO >> com.cloudera.flume.core.connector.DirectDriver - Connector logicalNode >> collector0_log_dir-19 exited with error: Blocked append interrupted by >> rotation event >> > java.lang.InterruptedException: Blocked append interrupted by rotation >> event >> > at >> com.cloudera.flume.handlers.rolling.RollSink.append(RollSink.java:209) >> > >> > >> > > 1500-2000 events per second >> > >> > It's not really a huge amount of data. Flume is expected to be able to >> handle it. >> > >> > Not sure anyone is looking at it. Sorry. >> > >> > Mingjie >> > >> > On 10/23/2011 09:07 AM, Eran Kutner wrote: >> >> Hi, >> >> I'm having a problem where flume collectors occasionally stop working >> >> under heavy load. >> >> I'm writing something like 1500-2000 events per second to my >> collectors, >> >> and occasionally they will just stop working. Nothing is written to the >> >> log the only indication that this is happening is that I see 0 messages >> >> being delivered when looking in the flume stats web page and events >> >> start pilling up in the agents. Restarting the service solves the >> >> problem for a while (anything from a few minutes to a few days). >> >> An interesting thing to note is that this seems to be load related. It >> >> used to happen a lot more but then I split the collector into three >> >> virtual nodes and balanced the traffic on them and now it happens a lot >> >> less. Also, while one virtual collector stops working the others, on >> the >> >> same machine, continue to work fine. >> >> >> >> My collector configuration looks like this: collectorSource(54001) | >> >> collector(600000) { >> >> escapedFormatDfs("hdfs://hadoop1-m1:8020/raw-events/%Y-%m-%d/", >> >> "events-%{rolltag}-f01-c1.snappy", seqfile("SnappyCodec")) }; >> >> >> >> I'm using 0.9.5 I've built a few weeks ago. >> >> >> >> Any ideas what can be causing it? >> >> >> >> -eran >> >> >> > > -- Thanks Cameron Gandevia