Re: What to do after database failures?

William Bajzek Tue, 27 Jul 2010 12:57:20 -0700

On Jul 23, 2010, at 1:51 PM, Eric Yang wrote:

> On 7/23/10 1:17 PM, "William Bajzek" <williambaj...@gmail.com> wrote:
> 
>> On Jul 23, 2010, at 9:58 AM, Eric Yang wrote:
>>> MetricDataLoader can be modified to throw IOException to the executor class
>>> MetricDataLoaderPool, which can throw exception to PostProcessor Manager.
>>> PostProcessorManager moves the data to a temp directory.  The retry logic
>>> can be added to PostProcessorManager by counting the number of retry with
>>> the error out sequence file before sending it to InError directory.  It
>>> should be the better route to manages error conditions.
>> 
>> I'll look into this, thanks. Other than that, is the only recourse to move 
>> the
>> failed material back into the queue manually?
> 
> Yes.  This page contains some useful information on the data flow:
> 
> http://wiki.apache.org/hadoop/Chukwa_Processes_and_Data_Flow
> 
> In 3.3, it moves from demuxProcessing to postProcess directory.  If it
> fails, move the data to demuxProcessing directory again and
> PostProcessorManager will pick up the directory again and attempt to load
> again.



Thanks. I've been able to work out a way to recover from this kind of failure, 
so I thought I'd post it for "posterity." I caused the post processing to fail 
by shutting down mysql before posting my xml and then waiting long enough for 
the post processor to run and fail. In this example, my data type is called 
"sample_data_1_0".

In order to reprocess, first you have to run 
$CHUKWA_HOME/bin/stop-data-processors.sh and then you have to look in 
postprocess.log and find this line for each failed job:

/tmp/chukwa/logs/postprocess.log:2010-07-26 13:26:40,509 INFO main 
MoveToRepository - >>>>>>>>>>>> Before 
Renamehdfs://localhost:9000/chukwa/postProcess/demuxOutputDir_1280175984809/chukwa/sample_data_1_0/sample_data_1_0_20100726_13_25.R.evt
 -- 
/chukwa/repos/chukwa/sample_data_1_0/20100726/13/25/sample_data_1_0_20100726_13_25.1.evt

Then, for each failed event, you have to do:

        bin/hadoop fs -mkdir 
/chukwa/postProcess/demuxOutputDir_1280175984809/chukwa/sample_data_1_0
        
        bin/hadoop fs -mv 
/chukwa/repos/chukwa/sample_data_1_0/20100726/13/25/sample_data_1_0_20100726_13_25.1.evt
 /chukwa/postProcess/demuxOutputDir_1280175984809/chukwa/sample_data_1_0
        
        bin/hadoop fs -rmr /chukwa/repos/chukwa/sample_data_1_0

Then you can run $CHUKWA_HOME/bin/start-data-processors.sh and you should be 
good to go. This process halts the whole system beginning with the demux, but 
fortunately once you have started everything up again, everything queued up 
should still get run, so assuming you fixed the source of the problem, you 
shouldn't lose anything.


- William Bajzek
williambaj...@gmail.com

Re: What to do after database failures?

Reply via email to