Hi all,
we are using the autotest framework to test the 'libguestfs' software. we 
created a single client-test control file to run our tests similarly to the 
methods that come from the kvm-autotest.
we have a lot of test cases (>1000). for every case we call the job.run_test() 
method in the single control file. it could take more than 20 hours for us to 
run a complete test job.

two days ago, we had completely run the job and came across that some logically 
succeeded cases (python code that did not generate any exception) had recorded 
'FAIL' lines in the status log file, with strange failure information that 
obviously came from some other failed cases.

after some investigation, we have probably found the source of the problem.
the job.run_test() calls the parallel.fork_start() to run the test in a forked 
child process, and parallel.fork_waitfor() to wait for the child process to 
complete.
in the client/bin/parallel.py we can see that if the child had any problem, it 
would serialize the catched exception in a file named 'error-pid', with pid=the 
process id of the child process. when the child process completes, the parent 
process then checks for the existence of the error-pid file, if it finds the 
file, then it believes that there was some error in the test, then it loads the 
exception from the file and throws it to allow the job.run_test() to record a 
FAIL info.

this is problematic with our super-long test job. the pid_max value on our 
system is generally 32768. since every test case is run in a separate process, 
and the cases create child processes themselves, the pid_max are very likely to 
be quickly reached and recycled for newly forked processes.

on this basis, we believe that the wrong error logs are mistakenly taken from 
previously failed cases, just because they had the same recycled pids as those 
previous cases that ended hours ago, and found the error-pid file to determine 
the wrong status of our good cases.

practically, we had got successfully reproduced the problem, which directly 
proved our thought.

as a work around, we could set the pid_max value to some very big number in the 
control file, but this does not solve problems for super long jobs or somewhere 
we can't change the pid_max value.

a safer method is to move the error-pid file to some where else or some other 
name after using. this prevents the problem to occur, but I don't know if there 
was any other usage of this error-pid file other than logging purpose.

thanks for your patience. would like to seek for your advice.

Regards.
Jinxin Zheng
_______________________________________________
Autotest mailing list
[email protected]
http://test.kernel.org/cgi-bin/mailman/listinfo/autotest

Reply via email to