On 02/11/15 16:57, [email protected] wrote:
This adds a policy which advises when user should reboot system to avoid
noisy test results due to system becoming unstable, for instance, and
therefore continues testing successfully.
To do this, a new Dmesg class is proposed which is not filtering dmesg and
monitors whether or not one of the following event occurs:
- gpu reset failed (not just gpu reset happened, that happens way too
often and many tests even provoke hangs intentionally)
- gpu crash,
- Oops:
- BUG
- lockdep splat that causes the locking validator to get disabled
If one of these issues happen, piglit test execution is stopped
-terminating test thread pool- and exit with code 3 to inform that reboot
is advised.
Then test execution resume, after rebooting system or not, is done like
usually with command line parameter "resume".

To call it, use command line parameter: --dmesg monitored

Hello Yann,

The rationale behind this patch is very sound and we need something like this. Here are however a list of nitpicks:

- Please send patches with git send-email, otherwise, it makes it impossible for us to comment inline which is the usual process for patch review. Please re-send :)

- varaiable -> variable; double space after "when a reboot may be required"

- I am not a big fan of changing the semantic of arguments that have been there forever. Can you think of a case where the user would not want the test to abort if we reach a state where we cannot trust the result? I am including Dylan on this. Also, if we are to keep these modes, can we rename the "simple" mode to "warning" and "monitored" to "abort"? This would make more sense and clearly state the goal of the modes.

- I see a potential issue with the dmesg-monitoring class, it can generate false positives. Indeed, it completely ignores the fact that there may be more than one GPU installed and that it may come from a different vendor. It would be good to know which gpu the test got run on and only track this gpu for bugs. This is however a complex task and would require env_dump to be done properly. I guess we could pass an argument to env_dump to track the gpu-vendor-dependent codepath and leave the general cases (Oops, BUG and lockdep) up to python. Anyway, better safe than sorry in the mean time!

- I am not a big fan of using dmesg to get information out of the kernel because the output format may change. How about listening for udev events coming from the gpu or reading i915_error_state/i915_wedged ?

That's it for the first round :)

Thanks for the proposal!
Martin
_______________________________________________
Piglit mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/piglit

Reply via email to