Re: ChukwaArchiveManager and DemuxManager

Eric Yang Tue, 02 Feb 2010 16:38:26 -0800

Hi Bill,

I love to see this on our wiki.  The description is very close to what is
happening in Chukwa data processing.  If you could update step 3 with:
 
- DemuxManager checks for *.done file every 20 seconds.
- If *.done file exists, move to demuxProcessing/mrInput for demux.
- If demux is successful within 3 attempts, move demuxProcessing/mrOutput to
dataSinkArchive/[yyyyMMdd]/*/*.done, otherwise move data to InError
directory.


Thanks

Regards,
Eric

On 2/2/10 10:56 AM, "Bill Graham" <billgra...@gmail.com> wrote:

> I had a lot of questions regarding the data flow as well. I spent a while
> reverse engineering it and wrote something up on our internal wiki. I believe
> this is what's happening. If others with more knowledge could verify what I
> have below, I'll gladly move it to a wiki on the Chukwa site.
> 
> Regarding your specific question, I believe the DemuxManager process is the
> first step in aggregating the data sink files. It moves the chunks to the
> dataSinkArchives directory once it's done with them. The ArchiveManager later
> archives those chunks.
> 
> 1.  Collectors write chunks to logs/*.chukwa files until a 64MB chunk size is
> reached or a given time interval is reached.
>> *  to: logs/*.chukwa
> 2.  Collectors close chunks and rename them to *.done
>> *  from: logs/*.chukwa
>> *  to: logs/*.done
> 3.  DemuxManager wakes up every 20 seconds, runs M/R to merges *.done files
> and moves them. 
>> *  from: logs/*.done
>> *  to: demuxProcessing/mrInput
>> *  to: demuxProcessing/mrOutput
>> *  to: dataSinkArchives/[yyyyMMdd]/*/*.done
> 4.  PostProcessManager wakes up every few minutes and  aggregates, orders and
> de-dups record files.
>> *  from: 
>> postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[yyyyMMdd]_[
>> HH].R.evt 
>> *  to: 
>> repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[HH
>> ]_[N].[N].evt 
> 5.  HourlyChukwaRecordRolling runs M/R jobs at 16 past the hour to group 5
> minute logs to hourly.
>> *  from: 
>> repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[mm
>> ].[N].evt 
>> *  to: temp/hourlyRolling/[clusterName]/[dataType]/[yyyyMMdd]
>> *  to: 
>> repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_HourlyDone_[yyyyMMd
>> d]_[HH].[N].evt 
>> *  leaves: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/rotateDone/
> 6.  DailyChukwaRecordRolling runs M/R jobs at 1:30AM to group hourly logs to
> daily. 
>> *  from: 
>> repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_[yyyyMMdd]_[HH].[N]
>> .evt 
>> *  to: temp/dailyRolling/[clusterName]/[dataType]/[yyyyMMdd]
>> *  to: 
>> repos/[clusterName]/[dataType]/[yyyyMMdd]/[dataType]_DailyDone_[yyyyMMdd].[N]
>> .evt 
>> *  leaves: repos/[clusterName]/[dataType]/[yyyyMMdd]/rotateDone/
> 7.  ChukwaArchiveManager every half hour or so aggregates and removes
> dataSinkArchives data using M/R.
>> *  from: dataSinkArchives/[yyyyMMdd]/*/*.done
>> *  to: archivesProcessing/mrInput
>> *  to: archivesProcessing/mrOutput
>> *  to: finalArchives/[yyyyMMdd]/*/chukwaArchive-part-*
> 
> thanks,
> Bill
> 
> On Tue, Feb 2, 2010 at 10:21 AM, Corbin Hoenes <cor...@tynt.com> wrote:
>> I am trying to understand the flow of data inside hdfs as it's processed by
>> the data processor script.
>> I see the archive.sh and demux.sh are run which runs ArchiveManager and
>> DemuxManager.   It appears to that just reading the code that they both are
>> looking at the data sink (default /chukwa/logs).
>> 
>> Can someone shed some light on how ArchiveManager and DemuxManager interact?
>>  E.g. I was under the impression that the data flowed through the archiving
>> process first then got fed into the demuxing after it had created .arc files.
>> 
> 
>

Re: ChukwaArchiveManager and DemuxManager

Reply via email to