> > >-----Original Message----- >From: Daniel Vetter [mailto:[email protected]] On Behalf Of Daniel Vetter >Sent: Friday, December 11, 2015 5:06 PM >To: Morton, Derek J >Cc: Daniel Vetter; [email protected]; Wood, Thomas >Subject: Re: [Intel-gfx] [PATCH i-g-t] gem_flink_race/prime_self_import: >Improve test reliability > >On Fri, Dec 11, 2015 at 10:33:46AM +0000, Morton, Derek J wrote: >> > >> > >> >-----Original Message----- >> >From: Daniel Vetter [mailto:[email protected]] On Behalf Of >> >Daniel Vetter >> >Sent: Thursday, December 10, 2015 12:53 PM >> >To: Morton, Derek J >> >Cc: Daniel Vetter; [email protected]; Wood, Thomas >> >Subject: Re: [Intel-gfx] [PATCH i-g-t] >> >gem_flink_race/prime_self_import: Improve test reliability >> > >> >On Thu, Dec 10, 2015 at 11:51:29AM +0000, Morton, Derek J wrote: >> >> > >> >> > >> >> >-----Original Message----- >> >> >From: Daniel Vetter [mailto:[email protected]] On Behalf Of >> >> >Daniel Vetter >> >> >Sent: Thursday, December 10, 2015 10:13 AM >> >> >To: Morton, Derek J >> >> >Cc: [email protected]; Wood, Thomas >> >> >Subject: Re: [Intel-gfx] [PATCH i-g-t] >> >> >gem_flink_race/prime_self_import: Improve test reliability >> >> > >> >> >On Tue, Dec 08, 2015 at 12:44:44PM +0000, Derek Morton wrote: >> >> >> gem_flink_race and prime_self_import have subtests which read >> >> >> the number of open gem objects from debugfs to determine if >> >> >> objects have leaked during the test. However the test can fail >> >> >> sporadically if the number of gem objects changes due to other process >> >> >> activity. >> >> >> This patch introduces a change to check the number of gem >> >> >> objects several times to filter out any fluctuations. >> >> > >> >> >Why exactly does this happen? IGT tests should be run on bare >> >> >metal, with everything else killed/subdued/shutup. If there's >> >> >still things going on that create objects, we need to stop them from >> >> >doing that. >> >> > >> >> >If this only applies to Android, or some special Android deamon >> >> >them imo check for that at runtime and igt_skip("your setup is >> >> >invalid, deamon %s running\n"); is the correct fix. After all just >> >> >because you sampled for a bit doesn't mean that it wont still >> >> >change right when you start running the test for real, so this is still >> >> >fragile. >> >> >> >> Before running tests on android we do stop everything possible. I >> >> suspect the culprit is coreu getting automatically restarted after >> >> it is stopped. I had additional debug while developing this patch >> >> and what I saw was the system being mostly quiescent but with some >> >> very low level background activity. 1 extra object being created >> >> and then deleted occasionally. Depending on whether it occurred at >> >> the start or end of the test it was resulting in a reported leak of >> >> either 1 or -1 objects. >> >> The patch fixes that issue by taking several samples and requiring >> >> them to be the same, therefore filtering out the low level background >> >> noise. >> >> It would not help if something in the background allocated an >> >> object and kept it allocated, but I have not seen that happen. I >> >> only saw once the object count increasing for 2 consecutive reads >> >> hence the count to 4 to give a margin. The test was failing about >> >> 10%. With this patch I got 100% pass across 300 runs of each of the tests. >> > >> >Hm, piglit checks that there's no other drm clients running. Have you tried >> >re-running that check to zero in on the culprit? >> >> We don't use piglet to run IGT tests on Android. I have had a look at >> what piglet does and added the same check to our scripts. (It reads a list >> of clients from /sys/kernel/debug/dri/0/clients) For CHV it shows a process >> called 'y', though that seems to be some issue on CHV that all driver >> clients are called 'y'. I checked on BXT which properly shows the process >> names and it looks like it is the binder process (which is handling some >> inter process communication). I don't think this is something we can stop. > >Nah, you definitely can't stop binder, won't have an android left after that >;-) > >But it is strange that binder owns these buffers. Binder is just IPC, but like >unix domain sockets you can also throw around file descriptors. So something >on your system is moving open drm fd devices still around. I don't have an >idea what kind of audit/debug tooling binder offers, but there should be a way >to figure out who really owns that file descriptor. >If you're lucky lsof (if android has that, otherwise walk /proc/*/fd/* >symlinks manually) should help. > >Cheers, Daniel
I am still suspicious that it is coreu that is the real culprit. I ran the tests Friday and did some reading of i915_gem_objects in parallel to get the full content of what it reported. I once caught coreu being listed as owning an active gem object during the test. Coreu gets stopped explicitly before tests are run but I suspect the init demon is detecting that it is stopped and is restarting it automatically. Something is reporting in logcat when coreu is stopped anyway. I will address your comments on V2 and submit another patch. //Derek > >> >> If you are concerned about the behaviour when running the test with >> >> a load of background activity I could add code to limit to the >> >> reset of the count and fail the test in that instance. That would >> >> give a benefit of distinguishing a test fail due to excessive >> >> background activity from a detected leak. >> > >> >I'm also concerned for the overhead this causes everyone else. If this >> >really is some Android trouble then I think it'd be good to only compile >> >this on Android. But would still be much better if you can get to a >> >reliably clean test environment. >> >> I will make the loop part android specific. >> >> >> //Derek >> >> > >> >> I would not want to just have the test skip as that introduces a >> >> hole in our test coverage. >> >> >> >> >Also would be good to extract get_stable_obj_count to a proper igt >> >> >library function, if it indeed needs to be this tricky. And then >> >> >add the explanation for why we need this in the gtkdoc. >> >> >> >> I can move the code to an igt library. Which library would you suggest? >> >> Igt_debugfs ? >> > >> >Hm yeah, it's a bit the dumping ground for all things debugfs access >> >;-) -Daniel >> >-- >> >Daniel Vetter >> >Software Engineer, Intel Corporation >> >http://blog.ffwll.ch >> > > >-- >Daniel Vetter >Software Engineer, Intel Corporation >http://blog.ffwll.ch > _______________________________________________ Intel-gfx mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/intel-gfx
