Re: duplicate data

Ariel Rabkin Thu, 18 Mar 2010 21:55:18 -0700

The sequence ID of a chunk is, by default, the offset in the file of
its first byte.  We do some fairly complex hacks for file rotation, to
make sure that the IDs continue growing monotonically in that case.
If you start a tailer on a file, and leave it running, each line will
get numbered uniquely. if you stop it, and then start a new one at the
beginning of the file, you'll get duplicate data.


If you start a tailer, stop it, modify or overwrite the file, and then
start a new tailer, you'll be spurious duplicates.

--Ari

On Thu, Mar 18, 2010 at 9:50 PM, Corbin Hoenes <cor...@tynt.com> wrote:
> So in scenario the stream name should be the same but how do sequence IDs get 
> generated?  If I tried to tail the same log file 24 hours after doing it the 
> first time would they have the same seq id?
>
> On Mar 18, 2010, at 11:24 AM, Ariel Rabkin wrote:
>
>> Howdy,
>>
>> Chukwa does duplicate detection as follows: Each Chunk of data comes
>> with a stream name (such as the name of a log file) and a sequence ID.
>> If two chunks have the same name and ID, they're duplicate.  The
>> content isn't inspected.
>>
>> So in your example, the former will be treated as a duplicate, not the 
>> latter.
>>
>> --Ari
>>
>> On Thu, Mar 18, 2010 at 8:59 AM, Corbin Hoenes <cor...@tynt.com> wrote:
>>> Does anyone have more information about how chukwa removes duplicates 
>>> during demux? How does it decide what is a duplicate?  There are two cases 
>>> I am thinking of...
>>>
>>> 1 - we send the same log file to chukwa 2x
>>> 2 - we have the exact same line in a log file 2x
>>
>>
>>
>> --
>> Ari Rabkin asrab...@gmail.com
>> UC Berkeley Computer Science Department
>
>



-- 
Ari Rabkin asrab...@gmail.com
UC Berkeley Computer Science Department

Re: duplicate data

Reply via email to