[Please subscribe to new flume-user@incubator.apache.org list, bcc flume-u...@cloudera.org, cc flume-user@incubator.apache.org]
I've created a new FAQ that has a question about causes of duplicates. Here's a link: https://cwiki.apache.org/confluence/display/FLUME/Operations+FAQ#OperationsFAQ-Iamgettingalotofduplicatedeventdata.WhyisthishappeningandwhatcanIdotomakethisgoaway%3F Some questions: At the hour change, do file names changes? If the old file gets a new name tailDir thinks it is a new file.. and rereads it. If it is not, it sounds like we should add something to make it easier to tell if a dupe was due to e2e retries or due to tail. Maybe a retry counter an an event or event group, I'll file and issues on this. Jon. On Mon, Jul 11, 2011 at 10:42 PM, Michael Jiang <it.mjji...@gmail.com>wrote: > We have a web server which writes a new log file to a folder every 5 > minutes. taildir is used to upload logs into hdfs based on date and hour. We > found strange behavior. At hour T, we found new data not just in the folder > for T, also in the folders before T, s.a. T-1, T-2, ... This seems to be > some duplication of previous logs? So, my question is, how taildir works? I > guess a file in the folder should be tailed only once and agent is able to > detect new file in the folder. Then how this happened? I know if not > properly configured, ACK is lost or received before timeout, then agent will > request to resend data. Is there some good approach to tell if that is the > case? Thanks! > > --Michael > -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // j...@cloudera.com