On Thu, 2010-05-13 at 02:42 -0400, Jinxin Zheng wrote: > Hi all, > we are using the autotest framework to test the 'libguestfs' software. we > created a single client-test control file to run our tests similarly to the > methods that come from the kvm-autotest. > we have a lot of test cases (>1000). for every case we call the > job.run_test() method in the single control file. it could take more than 20 > hours for us to run a complete test job.
How nice, are you guys planning to contribute the testcases created for this purpose? > two days ago, we had completely run the job and came across that some > logically succeeded cases (python code that did not generate any exception) > had recorded 'FAIL' lines in the status log file, with strange failure > information that obviously came from some other failed cases. > > after some investigation, we have probably found the source of the problem. > the job.run_test() calls the parallel.fork_start() to run the test in a > forked child process, and parallel.fork_waitfor() to wait for the child > process to complete. > in the client/bin/parallel.py we can see that if the child had any problem, > it would serialize the catched exception in a file named 'error-pid', with > pid=the process id of the child process. when the child process completes, > the parent process then checks for the existence of the error-pid file, if it > finds the file, then it believes that there was some error in the test, then > it loads the exception from the file and throws it to allow the > job.run_test() to record a FAIL info. > > this is problematic with our super-long test job. the pid_max value on our > system is generally 32768. since every test case is run in a separate > process, and the cases create child processes themselves, the pid_max are > very likely to be quickly reached and recycled for newly forked processes. > > on this basis, we believe that the wrong error logs are mistakenly taken from > previously failed cases, just because they had the same recycled pids as > those previous cases that ended hours ago, and found the error-pid file to > determine the wrong status of our good cases. > > practically, we had got successfully reproduced the problem, which directly > proved our thought. > > as a work around, we could set the pid_max value to some very big number in > the control file, but this does not solve problems for super long jobs or > somewhere we can't change the pid_max value. > > a safer method is to move the error-pid file to some where else or some other > name after using. this prevents the problem to occur, but I don't know if > there was any other usage of this error-pid file other than logging purpose. I have checked the base job code, and it seems like once we are done processing a particular test subprocess, we can safely rename the error-[PID] file, so the approach you've used on your patch looks good to me. > thanks for your patience. would like to seek for your advice. Thanks for your work on this issue! > Regards. > Jinxin Zheng > _______________________________________________ > Autotest mailing list > [email protected] > http://test.kernel.org/cgi-bin/mailman/listinfo/autotest _______________________________________________ Autotest mailing list [email protected] http://test.kernel.org/cgi-bin/mailman/listinfo/autotest
