Hi all, we are using the autotest framework to test the 'libguestfs' software. we created a single client-test control file to run our tests similarly to the methods that come from the kvm-autotest. we have a lot of test cases (>1000). for every case we call the job.run_test() method in the single control file. it could take more than 20 hours for us to run a complete test job.
two days ago, we had completely run the job and came across that some logically succeeded cases (python code that did not generate any exception) had recorded 'FAIL' lines in the status log file, with strange failure information that obviously came from some other failed cases. after some investigation, we have probably found the source of the problem. the job.run_test() calls the parallel.fork_start() to run the test in a forked child process, and parallel.fork_waitfor() to wait for the child process to complete. in the client/bin/parallel.py we can see that if the child had any problem, it would serialize the catched exception in a file named 'error-pid', with pid=the process id of the child process. when the child process completes, the parent process then checks for the existence of the error-pid file, if it finds the file, then it believes that there was some error in the test, then it loads the exception from the file and throws it to allow the job.run_test() to record a FAIL info. this is problematic with our super-long test job. the pid_max value on our system is generally 32768. since every test case is run in a separate process, and the cases create child processes themselves, the pid_max are very likely to be quickly reached and recycled for newly forked processes. on this basis, we believe that the wrong error logs are mistakenly taken from previously failed cases, just because they had the same recycled pids as those previous cases that ended hours ago, and found the error-pid file to determine the wrong status of our good cases. practically, we had got successfully reproduced the problem, which directly proved our thought. as a work around, we could set the pid_max value to some very big number in the control file, but this does not solve problems for super long jobs or somewhere we can't change the pid_max value. a safer method is to move the error-pid file to some where else or some other name after using. this prevents the problem to occur, but I don't know if there was any other usage of this error-pid file other than logging purpose. thanks for your patience. would like to seek for your advice. Regards. Jinxin Zheng _______________________________________________ Autotest mailing list [email protected] http://test.kernel.org/cgi-bin/mailman/listinfo/autotest
