[Please subscribe to new flume-user@incubator.apache.org list, bcc
flume-u...@cloudera.org, cc flume-user@incubator.apache.org]

I've created a new FAQ that has a question about causes of duplicates.
 Here's a link:

https://cwiki.apache.org/confluence/display/FLUME/Operations+FAQ#OperationsFAQ-Iamgettingalotofduplicatedeventdata.WhyisthishappeningandwhatcanIdotomakethisgoaway%3F

Some questions:

At the hour change, do file names changes?  If the old file gets a new name
tailDir thinks it is a new file.. and rereads it.

If it is not, it sounds like we should add something to make it easier to
tell if a dupe was due to e2e retries or due to tail.  Maybe a retry counter
an an event or event group, I'll file and issues on this.

Jon.

On Mon, Jul 11, 2011 at 10:42 PM, Michael Jiang <it.mjji...@gmail.com>wrote:

> We have a web server which writes a new log file to a folder every 5
> minutes. taildir is used to upload logs into hdfs based on date and hour. We
> found strange behavior. At hour T, we found new data not just in the folder
> for T, also in the folders before T, s.a. T-1, T-2, ... This seems to be
> some duplication of previous logs? So, my question is, how taildir works? I
> guess a file in the folder should be tailed only once and agent is able to
> detect new file in the folder. Then how this happened? I know if not
> properly configured, ACK is lost or received before timeout, then agent will
> request to resend data. Is there some good approach to tell if that is the
> case? Thanks!
>
> --Michael
>



-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// j...@cloudera.com

Reply via email to