Thanks Eric. I added a wiki, with a link from the main wiki page: http://wiki.apache.org/hadoop/Chukwa http://wiki.apache.org/hadoop/Chukwa_Processes_and_Data_Flow
I've added your comments to step 3. Please take a look and feel free to edit as you see fit. Can you verify the location of the InError directory in step 3? I listed it as dataSinkArchives/InError/[yyyyMMdd]/*/*.done, but I'm not positive about that. thanks, Bill On Tue, Feb 2, 2010 at 4:36 PM, Eric Yang <ey...@yahoo-inc.com> wrote: > Hi Bill, > > I love to see this on our wiki. The description is very close to what is > happening in Chukwa data processing. If you could update step 3 with: > > - DemuxManager checks for *.done file every 20 seconds. > - If *.done file exists, move to demuxProcessing/mrInput for demux. > - If demux is successful within 3 attempts, move demuxProcessing/mrOutput > to > dataSinkArchive/[yyyyMMdd]/*/*.done, otherwise move data to InError > directory. > > Thanks > > Regards, > Eric > > On 2/2/10 10:56 AM, "Bill Graham" <billgra...@gmail.com> wrote: > > > I had a lot of questions regarding the data flow as well. I spent a while > > reverse engineering it and wrote something up on our internal wiki. I > believe > > this is what's happening. If others with more knowledge could verify what > I > > have below, I'll gladly move it to a wiki on the Chukwa site. > > > > Regarding your specific question, I believe the DemuxManager process is > the > > first step in aggregating the data sink files. It moves the chunks to the > > dataSinkArchives directory once it's done with them. The ArchiveManager > later > > archives those chunks. > > > > 1. Collectors write chunks to logs/*.chukwa files until a 64MB chunk > size is > > reached or a given time interval is reached. > >> * to: logs/*.chukwa > > 2. Collectors close chunks and rename them to *.done > >> * from: logs/*.chukwa > >> * to: logs/*.done > > 3. DemuxManager wakes up every 20 seconds, runs M/R to merges *.done > files > > and moves them. > >> * from: logs/*.done > >> * to: demuxProcessing/mrInput > >> * to: demuxProcessing/mrOutput > >> * to: dataSinkArchives/[yyyyMMdd]/*/*.done > > 4. PostProcessManager wakes up every few minutes and aggregates, orders > and > > de-dups record files. > >> * from: > >> > postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[yyyyMMdd]_[ > >> HH].R.evt > >> * to: > >> > repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[HH > >> ]_[N].[N].evt > > 5. HourlyChukwaRecordRolling runs M/R jobs at 16 past the hour to group > 5 > > minute logs to hourly. > >> * from: > >> > repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[mm > >> ].[N].evt > >> * to: temp/hourlyRolling/[clusterName]/[dataType]/[yyyyMMdd] > >> * to: > >> > repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_HourlyDone_[yyyyMMd > >> d]_[HH].[N].evt > >> * leaves: repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/rotateDone/ > > 6. DailyChukwaRecordRolling runs M/R jobs at 1:30AM to group hourly logs > to > > daily. > >> * from: > >> > repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_[yyyyMMdd]_[HH].[N] > >> .evt > >> * to: temp/dailyRolling/[clusterName]/[dataType]/[yyyyMMdd] > >> * to: > >> > repos/[clusterName]/[dataType]/[yyyyMMdd]/[dataType]_DailyDone_[yyyyMMdd].[N] > >> .evt > >> * leaves: repos/[clusterName]/[dataType]/[yyyyMMdd]/rotateDone/ > > 7. ChukwaArchiveManager every half hour or so aggregates and removes > > dataSinkArchives data using M/R. > >> * from: dataSinkArchives/[yyyyMMdd]/*/*.done > >> * to: archivesProcessing/mrInput > >> * to: archivesProcessing/mrOutput > >> * to: finalArchives/[yyyyMMdd]/*/chukwaArchive-part-* > > > > thanks, > > Bill > > > > On Tue, Feb 2, 2010 at 10:21 AM, Corbin Hoenes <cor...@tynt.com> wrote: > >> I am trying to understand the flow of data inside hdfs as it's processed > by > >> the data processor script. > >> I see the archive.sh and demux.sh are run which runs ArchiveManager and > >> DemuxManager. It appears to that just reading the code that they both > are > >> looking at the data sink (default /chukwa/logs). > >> > >> Can someone shed some light on how ArchiveManager and DemuxManager > interact? > >> E.g. I was under the impression that the data flowed through the > archiving > >> process first then got fed into the demuxing after it had created .arc > files. > >> > > > > > >