Hi Bill, I love to see this on our wiki. The description is very close to what is happening in Chukwa data processing. If you could update step 3 with: - DemuxManager checks for *.done file every 20 seconds. - If *.done file exists, move to demuxProcessing/mrInput for demux. - If demux is successful within 3 attempts, move demuxProcessing/mrOutput to dataSinkArchive/[yyyyMMdd]/*/*.done, otherwise move data to InError directory.
Thanks Regards, Eric On 2/2/10 10:56 AM, "Bill Graham" <billgra...@gmail.com> wrote: > I had a lot of questions regarding the data flow as well. I spent a while > reverse engineering it and wrote something up on our internal wiki. I believe > this is what's happening. If others with more knowledge could verify what I > have below, I'll gladly move it to a wiki on the Chukwa site. > > Regarding your specific question, I believe the DemuxManager process is the > first step in aggregating the data sink files. It moves the chunks to the > dataSinkArchives directory once it's done with them. The ArchiveManager later > archives those chunks. > > 1. Collectors write chunks to logs/*.chukwa files until a 64MB chunk size is > reached or a given time interval is reached. >> * to: logs/*.chukwa > 2. Collectors close chunks and rename them to *.done >> * from: logs/*.chukwa >> * to: logs/*.done > 3. DemuxManager wakes up every 20 seconds, runs M/R to merges *.done files > and moves them. >> * from: logs/*.done >> * to: demuxProcessing/mrInput >> * to: demuxProcessing/mrOutput >> * to: dataSinkArchives/[yyyyMMdd]/*/*.done > 4. PostProcessManager wakes up every few minutes and aggregates, orders and > de-dups record files. >> * from: >> postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[yyyyMMdd]_[ >> HH].R.evt >> * to: >> repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[HH >> ]_[N].[N].evt > 5. HourlyChukwaRecordRolling runs M/R jobs at 16 past the hour to group 5 > minute logs to hourly. >> * from: >> repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[mm >> ].[N].evt >> * to: temp/hourlyRolling/[clusterName]/[dataType]/[yyyyMMdd] >> * to: >> repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_HourlyDone_[yyyyMMd >> d]_[HH].[N].evt >> * leaves: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/rotateDone/ > 6. DailyChukwaRecordRolling runs M/R jobs at 1:30AM to group hourly logs to > daily. >> * from: >> repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_[yyyyMMdd]_[HH].[N] >> .evt >> * to: temp/dailyRolling/[clusterName]/[dataType]/[yyyyMMdd] >> * to: >> repos/[clusterName]/[dataType]/[yyyyMMdd]/[dataType]_DailyDone_[yyyyMMdd].[N] >> .evt >> * leaves: repos/[clusterName]/[dataType]/[yyyyMMdd]/rotateDone/ > 7. ChukwaArchiveManager every half hour or so aggregates and removes > dataSinkArchives data using M/R. >> * from: dataSinkArchives/[yyyyMMdd]/*/*.done >> * to: archivesProcessing/mrInput >> * to: archivesProcessing/mrOutput >> * to: finalArchives/[yyyyMMdd]/*/chukwaArchive-part-* > > thanks, > Bill > > On Tue, Feb 2, 2010 at 10:21 AM, Corbin Hoenes <cor...@tynt.com> wrote: >> I am trying to understand the flow of data inside hdfs as it's processed by >> the data processor script. >> I see the archive.sh and demux.sh are run which runs ArchiveManager and >> DemuxManager. It appears to that just reading the code that they both are >> looking at the data sink (default /chukwa/logs). >> >> Can someone shed some light on how ArchiveManager and DemuxManager interact? >> E.g. I was under the impression that the data flowed through the archiving >> process first then got fed into the demuxing after it had created .arc files. >> > >