On Thu, 2010-05-13 at 02:42 -0400, Jinxin Zheng wrote:
> Hi all,
> we are using the autotest framework to test the 'libguestfs' software. we 
> created a single client-test control file to run our tests similarly to the 
> methods that come from the kvm-autotest.
> we have a lot of test cases (>1000). for every case we call the 
> job.run_test() method in the single control file. it could take more than 20 
> hours for us to run a complete test job.

How nice, are you guys planning to contribute the testcases created for
this purpose?

> two days ago, we had completely run the job and came across that some 
> logically succeeded cases (python code that did not generate any exception) 
> had recorded 'FAIL' lines in the status log file, with strange failure 
> information that obviously came from some other failed cases.
> 
> after some investigation, we have probably found the source of the problem.
> the job.run_test() calls the parallel.fork_start() to run the test in a 
> forked child process, and parallel.fork_waitfor() to wait for the child 
> process to complete.
> in the client/bin/parallel.py we can see that if the child had any problem, 
> it would serialize the catched exception in a file named 'error-pid', with 
> pid=the process id of the child process. when the child process completes, 
> the parent process then checks for the existence of the error-pid file, if it 
> finds the file, then it believes that there was some error in the test, then 
> it loads the exception from the file and throws it to allow the 
> job.run_test() to record a FAIL info.
> 
> this is problematic with our super-long test job. the pid_max value on our 
> system is generally 32768. since every test case is run in a separate 
> process, and the cases create child processes themselves, the pid_max are 
> very likely to be quickly reached and recycled for newly forked processes.
> 
> on this basis, we believe that the wrong error logs are mistakenly taken from 
> previously failed cases, just because they had the same recycled pids as 
> those previous cases that ended hours ago, and found the error-pid file to 
> determine the wrong status of our good cases.
> 
> practically, we had got successfully reproduced the problem, which directly 
> proved our thought.
> 
> as a work around, we could set the pid_max value to some very big number in 
> the control file, but this does not solve problems for super long jobs or 
> somewhere we can't change the pid_max value.
> 
> a safer method is to move the error-pid file to some where else or some other 
> name after using. this prevents the problem to occur, but I don't know if 
> there was any other usage of this error-pid file other than logging purpose.

I have checked the base job code, and it seems like once we are done
processing a particular test subprocess, we can safely rename the
error-[PID] file, so the approach you've used on your patch looks good
to me.

> thanks for your patience. would like to seek for your advice.

Thanks for your work on this issue!

> Regards.
> Jinxin Zheng
> _______________________________________________
> Autotest mailing list
> [email protected]
> http://test.kernel.org/cgi-bin/mailman/listinfo/autotest


_______________________________________________
Autotest mailing list
[email protected]
http://test.kernel.org/cgi-bin/mailman/listinfo/autotest

Reply via email to