"Omitted chunks" is an error. By definition, if chunks are omitted they won't be there. Duplicates and other peculiarities will happen in the event of failures. As you say, it's a consequence of the distributed environment.
SimpleArchiver should do the cleanup you want. --Ari On Mon, Nov 22, 2010 at 11:39 PM, Ying Tang <[email protected]> wrote: > Hi all , > After reading the chukwa docs , per my understanding , the log data flow > is : > adaptor-->agent-->collector-->sink file--->.... > In the doc says , " Data in the sink may include duplicate and omitted > chunks." > And it is not recommanded to write MapReduce jobs that directly examine > the data sink , "becaues jobs will likely discard most of their input ". > > Here is my question: > 1. Why data in sink file include duplicate and ommitted chunks ? Because > the distributed environmrnt ? > 2. How to solve the problem above ? The Simple Archiver generates the > archive file , and duplicates have been removed . So the simple archiver can > only solve the duplicate data , right? > > -- > Best regards, > Ivy Tang > > > -- Ari Rabkin [email protected] UC Berkeley Computer Science Department
