On Thu, Mar 15, 2012 at 12:36 AM, IvyTang <ivytang0...@gmail.com> wrote: > As the wiki says, Data in the sink may include duplicate and omitted > chunks.So we need demux and archive the raw data sink file . > > The start-data-processors.sh runs three processes , ChukwaArchiveManager > , PostProcessorManager and DemuxManager. > > This > page http://incubator.apache.org/chukwa/docs/r0.4.0/dataflow.html explains > the data workflow. > > First , DemuxManager moves raw *.done to > dataSinkArchives/[yyyyMMdd]/*/*.done. > > Then, ChukwaArchiveManager every half hour or so aggregates and removes > dataSinkArchives data using M/R , from dataSinkArchives/[yyyyMMdd]/*/*.done > to finalArchives/. > > The complete logflow is logs/*.done > ==> dataSinkArchives/[yyyyMMdd]/*/*.done ==> finalArchives > > 1. > Here , i have a question .Accoring to > the http://incubator.apache.org/chukwa/docs/r0.4.0/programming.html#Using+MapReduce , > Simple Archiver & Demux . The simple archiver removed the duplicates . > Does the simple archiver refers to the ChukwaArchiveManager?
No, these are separate pieces. Back in the day, I found that ChukwaArchiveManager was too complicated for my needs, and that I wanted a simple command that would just archive whatever was in the sink. And that's the simple archiver. It's found in org.apache.hadoop.chukwa.extraction.archive.SinkArchiver. > 3. Can i just run the DemuxManager & ChukwaArchiveManager ? i found i > just need these two components. Yes, you should be fine with just those if they meet your needs. -- Ari Rabkin asrab...@gmail.com UC Berkeley Computer Science Department