Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com wrote: From: Brad Volkin bradley.d.vol...@intel.com Certain OpenGL features (e.g. transform feedback, performance monitoring) require userspace code to submit batches containing commands such as MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some generations of the hardware will noop these commands in unsecure batches (which includes all userspace batches submitted via i915) even though the commands may be safe and represent the intended programming model of the device. This series introduces a software command parser similar in operation to the command parsing done in hardware for unsecure batches. However, the software parser allows some operations that would be noop'd by hardware, if the parser determines the operation is safe, and submits the batch as secure to prevent hardware parsing. Currently the series implements this on IVB and HSW. Just one more question... Do you have a branch for people to test? -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Wed, Feb 05, 2014 at 10:18:44AM -0800, Volkin, Bradley D wrote: On Wed, Feb 05, 2014 at 02:28:29AM -0800, Chris Wilson wrote: On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com wrote: From: Brad Volkin bradley.d.vol...@intel.com Certain OpenGL features (e.g. transform feedback, performance monitoring) require userspace code to submit batches containing commands such as MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some generations of the hardware will noop these commands in unsecure batches (which includes all userspace batches submitted via i915) even though the commands may be safe and represent the intended programming model of the device. This series introduces a software command parser similar in operation to the command parsing done in hardware for unsecure batches. However, the software parser allows some operations that would be noop'd by hardware, if the parser determines the operation is safe, and submits the batch as secure to prevent hardware parsing. Currently the series implements this on IVB and HSW. Just one more question... Do you have a branch for people to test? Not at the moment. And as mentioned in the v2 cover letter, it's actually not particularly testable (or mergeable for that matter) right now because of a regression in secure dispatch on nightly. At this moment, I just want to be sure that the fixed dispatch overhead has been minimised. -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Wed, Feb 05, 2014 at 02:28:29AM -0800, Chris Wilson wrote: On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com wrote: From: Brad Volkin bradley.d.vol...@intel.com Certain OpenGL features (e.g. transform feedback, performance monitoring) require userspace code to submit batches containing commands such as MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some generations of the hardware will noop these commands in unsecure batches (which includes all userspace batches submitted via i915) even though the commands may be safe and represent the intended programming model of the device. This series introduces a software command parser similar in operation to the command parsing done in hardware for unsecure batches. However, the software parser allows some operations that would be noop'd by hardware, if the parser determines the operation is safe, and submits the batch as secure to prevent hardware parsing. Currently the series implements this on IVB and HSW. Just one more question... Do you have a branch for people to test? Not at the moment. And as mentioned in the v2 cover letter, it's actually not particularly testable (or mergeable for that matter) right now because of a regression in secure dispatch on nightly. -Chris -- Chris Wilson, Intel Open Source Technology Centre ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Wed, Feb 5, 2014 at 7:18 PM, Volkin, Bradley D bradley.d.vol...@intel.com wrote: On Wed, Feb 05, 2014 at 02:28:29AM -0800, Chris Wilson wrote: On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com wrote: From: Brad Volkin bradley.d.vol...@intel.com Certain OpenGL features (e.g. transform feedback, performance monitoring) require userspace code to submit batches containing commands such as MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some generations of the hardware will noop these commands in unsecure batches (which includes all userspace batches submitted via i915) even though the commands may be safe and represent the intended programming model of the device. This series introduces a software command parser similar in operation to the command parsing done in hardware for unsecure batches. However, the software parser allows some operations that would be noop'd by hardware, if the parser determines the operation is safe, and submits the batch as secure to prevent hardware parsing. Currently the series implements this on IVB and HSW. Just one more question... Do you have a branch for people to test? Not at the moment. And as mentioned in the v2 cover letter, it's actually not particularly testable (or mergeable for that matter) right now because of a regression in secure dispatch on nightly. The command parser itself should still work, even with the regression in -nightly. The copying and secure dispatch are obviously fail atm. That still leaves regression testing of current userspace and micro-optimizing the checker itself as possible things to do. Otoh not sure what exactly Chris wanted to test. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Wed, Feb 05, 2014 at 10:30:00AM -0800, Daniel Vetter wrote: On Wed, Feb 5, 2014 at 7:18 PM, Volkin, Bradley D bradley.d.vol...@intel.com wrote: On Wed, Feb 05, 2014 at 02:28:29AM -0800, Chris Wilson wrote: On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com wrote: From: Brad Volkin bradley.d.vol...@intel.com Certain OpenGL features (e.g. transform feedback, performance monitoring) require userspace code to submit batches containing commands such as MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some generations of the hardware will noop these commands in unsecure batches (which includes all userspace batches submitted via i915) even though the commands may be safe and represent the intended programming model of the device. This series introduces a software command parser similar in operation to the command parsing done in hardware for unsecure batches. However, the software parser allows some operations that would be noop'd by hardware, if the parser determines the operation is safe, and submits the batch as secure to prevent hardware parsing. Currently the series implements this on IVB and HSW. Just one more question... Do you have a branch for people to test? Not at the moment. And as mentioned in the v2 cover letter, it's actually not particularly testable (or mergeable for that matter) right now because of a regression in secure dispatch on nightly. The command parser itself should still work, even with the regression in -nightly. The copying and secure dispatch are obviously fail atm. That still leaves regression testing of current userspace and micro-optimizing the checker itself as possible things to do. Otoh not sure what exactly Chris wanted to test. To test/merge, we'd have to change the series to take out the part where patch 02/13 sets I915_DISPATCH_SECURE to avoid a BUG_ON() when i915.enable_cmd_parser=1. But yes, otherwise the parsing works and I think should be sufficient for what Chris indicated he wants to test. - Brad -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Wed, Feb 5, 2014 at 8:00 PM, Volkin, Bradley D bradley.d.vol...@intel.com wrote: To test/merge, we'd have to change the series to take out the part where patch 02/13 sets I915_DISPATCH_SECURE to avoid a BUG_ON() when i915.enable_cmd_parser=1. But yes, otherwise the parsing works and I think should be sufficient for what Chris indicated he wants to test. Oh, I didn't spot this but this needs to be moved way back in the series - we can only set the bit once we have the batchbuffer copy logic in place. Otherwise there's a security hole open since userspace is free to frob the batch residing in the ppgtt, which we just can't prevent. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Wed, Feb 05, 2014 at 11:17:25AM -0800, Daniel Vetter wrote: On Wed, Feb 5, 2014 at 8:00 PM, Volkin, Bradley D bradley.d.vol...@intel.com wrote: To test/merge, we'd have to change the series to take out the part where patch 02/13 sets I915_DISPATCH_SECURE to avoid a BUG_ON() when i915.enable_cmd_parser=1. But yes, otherwise the parsing works and I think should be sufficient for what Chris indicated he wants to test. Oh, I didn't spot this but this needs to be moved way back in the series - we can only set the bit once we have the batchbuffer copy logic in place. Otherwise there's a security hole open since userspace is free to frob the batch residing in the ppgtt, which we just can't prevent. Good point. I'll take it out and we can add it as part of the batch copy work. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Tue, Dec 10, 2013 at 04:58:18PM -0800, Volkin, Bradley D wrote: So, I have a functioning kmap_atomic based parser using an sg_mapping_iter, and in the tests I'm running, it's worse than the vmap approach. This is still without the batch copy, but I think it's relevant anyhow. I haven't done much in the way of analysis as to why we're getting these results. At a high level, the vmap approach leads to simple code with a few function calls and control structures. The per-page approach has somewhat more complex logic around mapping the next page at the right time and checking for commands that cross a page boundary. I'd guess that difference factors into it. Due to the risk of regressions, I think it would be better not to delay getting broader functional and performance testing by waiting to sort this out. I'd rather stick with vmap for now and revisit that overhead once we have more concrete performance data to work from. Yeah, makes sense. Just to check: Was that on hsw with llc coherency or on vlv? Might be that the clflushing shifts the picture around a bit. I'll propose that I send an updated series later this week or next that addresses feedback from the review, including better handling for secure and chained batches, the sync fixes for gem_cpu_reloc, some of the additional tests you suggested, and possibly an attempt at batch copy. If that goes well, could we see about getting the patches into the hands of your QA team for further testing? We unfortunately don't really have tons of spare cycles from our QA team for testing branches (pretty much none actually), so the usual approach is to review and merge patches without first going through QA. If we pull in your new i-g-ts first we should have decent assurance that nothing blows up. And since kernel patch series should always be fully bisectable we can stop at any point in time if something goes wrong. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Wed, Dec 11, 2013 at 01:54:40AM -0800, Daniel Vetter wrote: On Tue, Dec 10, 2013 at 04:58:18PM -0800, Volkin, Bradley D wrote: So, I have a functioning kmap_atomic based parser using an sg_mapping_iter, and in the tests I'm running, it's worse than the vmap approach. This is still without the batch copy, but I think it's relevant anyhow. I haven't done much in the way of analysis as to why we're getting these results. At a high level, the vmap approach leads to simple code with a few function calls and control structures. The per-page approach has somewhat more complex logic around mapping the next page at the right time and checking for commands that cross a page boundary. I'd guess that difference factors into it. Due to the risk of regressions, I think it would be better not to delay getting broader functional and performance testing by waiting to sort this out. I'd rather stick with vmap for now and revisit that overhead once we have more concrete performance data to work from. Yeah, makes sense. Just to check: Was that on hsw with llc coherency or on vlv? Might be that the clflushing shifts the picture around a bit. IVB and HSW. There's now a wait_rendering() call that should cover the gem_cpu_reloc case. I haven't had a chance to go back and test on VLV to see if the clflushing helps with the other coherency issue. I'll propose that I send an updated series later this week or next that addresses feedback from the review, including better handling for secure and chained batches, the sync fixes for gem_cpu_reloc, some of the additional tests you suggested, and possibly an attempt at batch copy. If that goes well, could we see about getting the patches into the hands of your QA team for further testing? We unfortunately don't really have tons of spare cycles from our QA team for testing branches (pretty much none actually), so the usual approach is to review and merge patches without first going through QA. If we pull in your new i-g-ts first we should have decent assurance that nothing blows up. And since kernel patch series should always be fully bisectable we can stop at any point in time if something goes wrong. Ok, sounds good. I'm fine with whatever approach gets us the test coverage soonest. Brad -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Wed, Dec 11, 2013 at 7:04 PM, Volkin, Bradley D bradley.d.vol...@intel.com wrote: We unfortunately don't really have tons of spare cycles from our QA team for testing branches (pretty much none actually), so the usual approach is to review and merge patches without first going through QA. If we pull in your new i-g-ts first we should have decent assurance that nothing blows up. And since kernel patch series should always be fully bisectable we can stop at any point in time if something goes wrong. Ok, sounds good. I'm fine with whatever approach gets us the test coverage soonest. One thing that's always important is to get tangential prep work to the beginning of your patch series as much as possible. That way we can mergetest those glue patches (and their impact on the rest of the driver) even when review is blocked on some contentious topic that affects the core of a new feature. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
[snip] On Tue, Nov 26, 2013 at 11:35:38AM -0800, Daniel Vetter wrote: 2) Coherency. I've found two types of coherency issues when reading the batch buffer from the CPU during execbuffer2. Looking for help with both issues. i. First, the i-g-t test gem_cpu_reloc blits to a batch buffer and the parser isn't properly waiting for the operation to complete before parsing. I tried adding i915_gem_object_sync(batch_obj, [ring|NULL]) but that actually caused more failures. This synchronization should happen when processing the relocations. The batch itself isn't modified by the gpu, we simply upload it using the blitter. So this going wrong indicates there's some issue somewhere ... I looked at this again. The blitter copy uploads a batch full of noops and there are no relocations associated with the failing batch, as far as I can tell. I suspect that would explain why relocation handling wouldn't catch this. In any case, I copied the relevant setup code from shmem pread, and it resolved the issue. ii. Second, on VLV, I've seen cache coherency issues when userspace writes the batch via pwrite fast path before calling execbuffer2. The parser reads stale data. This works fine on IVB and HSW, so I believe it's an LLC vs. non-LLC issue. I'm just unclear on what the correct flushing or synchronization is for this scenario. Imo we take a good look at the optimized buffer read/write code from i915_gem_shmem_pread (for reading the userspace batch) and i915_gem_shmem_pwrite (for writing to the checked buffer). If we do the checking in-line with the reading this should also bring down the overhead massively compared to the current solution (those shmem read/write functions are fairly optimized). So, I have a functioning kmap_atomic based parser using an sg_mapping_iter, and in the tests I'm running, it's worse than the vmap approach. This is still without the batch copy, but I think it's relevant anyhow. I haven't done much in the way of analysis as to why we're getting these results. At a high level, the vmap approach leads to simple code with a few function calls and control structures. The per-page approach has somewhat more complex logic around mapping the next page at the right time and checking for commands that cross a page boundary. I'd guess that difference factors into it. Due to the risk of regressions, I think it would be better not to delay getting broader functional and performance testing by waiting to sort this out. I'd rather stick with vmap for now and revisit that overhead once we have more concrete performance data to work from. I'll propose that I send an updated series later this week or next that addresses feedback from the review, including better handling for secure and chained batches, the sync fixes for gem_cpu_reloc, some of the additional tests you suggested, and possibly an attempt at batch copy. If that goes well, could we see about getting the patches into the hands of your QA team for further testing? Brad 3) 2nd-level batches. The parser currently allows MI_BATCH_BUFFER_START commands in userspace batches without parsing them. The media driver uses 2nd-level batches, so a solution is required. I have some ideas but don't want to delay the review process for what I have so far. It may be that the 2nd-level parsing is only needed for VCS and the current code (plus rejecting BBS) would be sufficient for RCS. Afaik only libva uses second-level batches, and only on the vcs. So I hope we can just submit those as unpriviledged batches if possible. If that's not possible it'll get fairly ugly I fear :( 4) Command buffer copy. To avoid CPU modifications to buffers after parsing, and to avoid GPU modifications to buffers via EUs or commands in the batch, we should copy the userspace batch buffer to memory that userspace does not have access to, map it into GGTT, and execute that batch buffer. I have a sense of how to do this for 1st-level batches, but it would need changes to tie in with the 2nd-level batch parsing I think, so I've again held off. Yeah, we need the copying for otherwise the parsing is fairly pointless. I've stumbled over some of your internally patches and had a quick look at them. Two high-level comments: - Using the existing active buffer lru instead of manual pinning would better integrate with the eviction code. For an example of in-kernel objects (and not userspace objects) using this look at the hw context code. - Imo we should tag all buffers as purgeable while they're in the cache. That way the shrinker will automatically drop the backing storage if memory runs low (and thanks to the active list lru only when the gpu has stopped processing the batch). That way we can just keep on allocating buffers if they're all busy without any concern
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Tue, Nov 26, 2013 at 12:24:14PM -0800, Volkin, Bradley D wrote: On Tue, Nov 26, 2013 at 11:35:38AM -0800, Daniel Vetter wrote: I think long-term we should even scan secure batches. We'd need to allow some registers which only the drm master (i.e. owner of the display hardware) is allowed to do, e.g. for scanline waits. But once we have that we should be able to port all current users of secure batches over to scanned batches and so enforce this everywhere by default. The other issue is that igt tests assume to be able to run some evil tests, so maybe we don't actually want this. Agreed. I thought we could handle this as a follow-up task once the basic stuff is in place, particularly given that we'd want to modify at least some users to test. I also wasn't sure if we would want the check to be root master, as in the current secure flag, or just master. So my plan to initially not parse secure batches might be shot. During further testing, I found that it looks like Ubuntu 13.10 ships with fdo bug 71328 out of the box (sna doesn't set the EXEC_SECURE flag when doing scanline waits). Sooo... If we parse all batches and allow extra commands/registers from the drm master, should that list just be the commands/registers used for scanline waits? Are there others you can think of? Brad ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Thu, Dec 5, 2013 at 9:47 PM, Volkin, Bradley D bradley.d.vol...@intel.com wrote: On Tue, Nov 26, 2013 at 12:24:14PM -0800, Volkin, Bradley D wrote: On Tue, Nov 26, 2013 at 11:35:38AM -0800, Daniel Vetter wrote: I think long-term we should even scan secure batches. We'd need to allow some registers which only the drm master (i.e. owner of the display hardware) is allowed to do, e.g. for scanline waits. But once we have that we should be able to port all current users of secure batches over to scanned batches and so enforce this everywhere by default. The other issue is that igt tests assume to be able to run some evil tests, so maybe we don't actually want this. Agreed. I thought we could handle this as a follow-up task once the basic stuff is in place, particularly given that we'd want to modify at least some users to test. I also wasn't sure if we would want the check to be root master, as in the current secure flag, or just master. So my plan to initially not parse secure batches might be shot. During further testing, I found that it looks like Ubuntu 13.10 ships with fdo bug 71328 out of the box (sna doesn't set the EXEC_SECURE flag when doing scanline waits). Sooo... If we parse all batches and allow extra commands/registers from the drm master, should that list just be the commands/registers used for scanline waits? Are there others you can think of? No, I think the additional stuff the drm master is allowed to do is just the scanline waits. Everything else that the ddx might or might not be doing should also be possible as a normal process (e.g. the BTILE register to switch the blitter into y-tiled mode). But Chris should know if there's anything else really. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Wed, Dec 4, 2013 at 9:13 AM, Daniel Vetter dan...@ffwll.ch wrote: Ok, I'll look at the hw context code for buffer mgmt. For purgeable, just via the madv field in the i915 gem object? Yeah, though I'd extract two tiny helpers (maybe shared with the madvise ioctl) to set an object to purgeable and then resurrect it. The later can obviously fail. The helpers are just so we have a place to throw debug asserts into, maybe there are other needs for in-kernel caches. Since we've had tons of fun in the past few months with low memory handling I think an evil testcase to exercise this cache purging logic a bit would be good. Quick idea would be to submit dummy batches that double in size every time (to ensure we have a miss in the cache), with the last batch careful sized so that the userspace batch plus the kernel copy will still fit, but not together with all the older cached batches. I expect that our other memory thrashing tests will also exercise the cache purging, but having a specific test which doesn't exercise anything else really helps to separate issues. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Wed, Dec 04, 2013 at 12:13:39AM -0800, Daniel Vetter wrote: On Tue, Nov 26, 2013 at 9:24 PM, Volkin, Bradley D bradley.d.vol...@intel.com wrote: [snip] Which state setup stuff are you referring to? Something specific in i-g-t or something more general? The state setup 3D commands as opposed to doing actual rendering commands (with 3D_PRIM). Just to have a bit more realistic cs opcode lengths for micro-benchmarking. [snip] Ok, I'll look at the hw context code for buffer mgmt. For purgeable, just via the madv field in the i915 gem object? Yeah, though I'd extract two tiny helpers (maybe shared with the madvise ioctl) to set an object to purgeable and then resurrect it. The later can obviously fail. The helpers are just so we have a place to throw debug asserts into, maybe there are other needs for in-kernel caches. Hmm. Not sure how much use the helpers would be. The ioctl has one case where it will truncate() the object, but beyond that it just sets madv appropriately. In general, purging and resurrecting appear to happen via a put_pages() from the shrinker and a get_pages() the next time the object is needed. Also, I haven't looked too closely at the shrinker code. Could there be potential races between the shrinker purging a buffer and an execbuf call choosing it? I would expect that synchronization is already in place. Just curious if you know off the top of your head. Brad Also, there are a couple iterations of the work-in-progress patches. Do you prefer a cache per ring or a single cache shared by all rings? I've pondered a bunch of reasons for/against the two approaches and I think it won't really matter. Maybe slightly leaning towards per-ring caches since then objects retire in order. Well, until we do preemption ;-) -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Thu, Dec 5, 2013 at 2:40 AM, Volkin, Bradley D bradley.d.vol...@intel.com wrote: Ok, I'll look at the hw context code for buffer mgmt. For purgeable, just via the madv field in the i915 gem object? Yeah, though I'd extract two tiny helpers (maybe shared with the madvise ioctl) to set an object to purgeable and then resurrect it. The later can obviously fail. The helpers are just so we have a place to throw debug asserts into, maybe there are other needs for in-kernel caches. Hmm. Not sure how much use the helpers would be. The ioctl has one case where it will truncate() the object, but beyond that it just sets madv appropriately. In general, purging and resurrecting appear to happen via a put_pages() from the shrinker and a get_pages() the next time the object is needed. Yeah, the helper would be indeed minimal. But they'd allow us to shovel checks like obj-pin_count or similar stuff in there to sanity-check users. Also, I haven't looked too closely at the shrinker code. Could there be potential races between the shrinker purging a buffer and an execbuf call choosing it? I would expect that synchronization is already in place. Just curious if you know off the top of your head. As long as you try to set the object back to WILLNEED before you start using it and free your reference if that fails (i.e. it's state is PURGED) then the magic just happens. I just reviewed the code a bit and it seems like we're missing a few sanity checks and don't properly reject purgeable objects in e.g. bind_to_vm. For marking the object purgeable you only have to take care to do that after its on the active list. The shrinker will then make sure to wait for the gpu to complete processing before it drops the backing storage. Misordering here would be caught by a obj-pin_count check, hence why I've suggested that. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Wed, Nov 27, 2013 at 09:32:32AM +0800, ykzhao wrote: On Tue, 2013-11-26 at 13:24 -0700, Volkin, Bradley D wrote: On Tue, Nov 26, 2013 at 11:35:38AM -0800, Daniel Vetter wrote: Hi Brad, On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com wrote: From: Brad Volkin bradley.d.vol...@intel.com Certain OpenGL features (e.g. transform feedback, performance monitoring) require userspace code to submit batches containing commands such as MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some generations of the hardware will noop these commands in unsecure batches (which includes all userspace batches submitted via i915) even though the commands may be safe and represent the intended programming model of the device. This series introduces a software command parser similar in operation to the command parsing done in hardware for unsecure batches. However, the software parser allows some operations that would be noop'd by hardware, if the parser determines the operation is safe, and submits the batch as secure to prevent hardware parsing. Currently the series implements this on IVB and HSW. The series is divided into several phases: patches 01-09: These implement infrastructure and the command parsing algorithm, all behind a module parameter. I expect some discussion and rework, but hopefully there's nothing too controversial. patches 10-17: These define the checks performed by the parser. I expect much discussion :) patches 18-20: In a final pass over the command checks, I found some issues with the definitions. They looked painful to rebase in, so I've added them here. patches 21-22: These enable the parser by default. It runs on all batches except those that set the I915_EXEC_SECURE flag in the execbuffer2 call. I think long-term we should even scan secure batches. We'd need to allow some registers which only the drm master (i.e. owner of the display hardware) is allowed to do, e.g. for scanline waits. But once we have that we should be able to port all current users of secure batches over to scanned batches and so enforce this everywhere by default. The other issue is that igt tests assume to be able to run some evil tests, so maybe we don't actually want this. Agreed. I thought we could handle this as a follow-up task once the basic stuff is in place, particularly given that we'd want to modify at least some users to test. I also wasn't sure if we would want the check to be root master, as in the current secure flag, or just master. W.r.t. the tests, I suppose we can just turn checking on for secure batches and see what happens. There are follow-up patches to libdrm and to i-g-t. The i-g-t tests are very basic and do not test all of the commands used by the parser on the assumption that I'm likely to make the same mistakes in both the parser and the test. Yeah, I agree that just checking whether commands all go through (or not) as expected adds very little value on top of the few tests you have done. I think we should take a look at some corner cases which might trip up your checker a bit though: - I think we should check batchbuffer chaining and make sure it works on the vcs ring and not anywhere else (we can't ever break shipping libva which uses this). - Some tests to trip up your parser should be done, like 3D commands that fall off the end of the batch bo. Or commands that span page boundaries. The later isn't an issue atm since you use vmap, but we should switch to per-page kmap since the vmap overhead is fairly horrible. Good suggestions. I'll look into these. Hi, Brad More inputs from libva about the batchbuffer chaining. Now the batchbuffer chaining is widely used in libva driver. This is related with how the libva driver processes the image. For the encoding purpose, it needs to be handled based on macroblock(16x16).And every macroblock needs a group of GPU commands. So the GPU commands for all the macroblocks will be constructed in the second-level batchbuffer. The mode of batchbuffer chaining will bring the following benefits: a. The size of second-level batch buffer can be allocated based on the size of handled image. For example: 1080p/720p/480p can use the different size. b. The gpu commands in second-level batchbuffer can be constructed by using GPU instead of CPU, which is helpful to improve the performance. At the same time both VCS and Render Ring are used in libva driver. For example: The encoding will use VCS and RCS ring. Firstly the RCS ring is used to execute GPU command for
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Wed, 2013-11-27 at 09:10 +0100, Daniel Vetter wrote: On Wed, Nov 27, 2013 at 09:32:32AM +0800, ykzhao wrote: On Tue, 2013-11-26 at 13:24 -0700, Volkin, Bradley D wrote: On Tue, Nov 26, 2013 at 11:35:38AM -0800, Daniel Vetter wrote: Hi Brad, On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com wrote: From: Brad Volkin bradley.d.vol...@intel.com Certain OpenGL features (e.g. transform feedback, performance monitoring) require userspace code to submit batches containing commands such as MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some generations of the hardware will noop these commands in unsecure batches (which includes all userspace batches submitted via i915) even though the commands may be safe and represent the intended programming model of the device. This series introduces a software command parser similar in operation to the command parsing done in hardware for unsecure batches. However, the software parser allows some operations that would be noop'd by hardware, if the parser determines the operation is safe, and submits the batch as secure to prevent hardware parsing. Currently the series implements this on IVB and HSW. The series is divided into several phases: patches 01-09: These implement infrastructure and the command parsing algorithm, all behind a module parameter. I expect some discussion and rework, but hopefully there's nothing too controversial. patches 10-17: These define the checks performed by the parser. I expect much discussion :) patches 18-20: In a final pass over the command checks, I found some issues with the definitions. They looked painful to rebase in, so I've added them here. patches 21-22: These enable the parser by default. It runs on all batches except those that set the I915_EXEC_SECURE flag in the execbuffer2 call. I think long-term we should even scan secure batches. We'd need to allow some registers which only the drm master (i.e. owner of the display hardware) is allowed to do, e.g. for scanline waits. But once we have that we should be able to port all current users of secure batches over to scanned batches and so enforce this everywhere by default. The other issue is that igt tests assume to be able to run some evil tests, so maybe we don't actually want this. Agreed. I thought we could handle this as a follow-up task once the basic stuff is in place, particularly given that we'd want to modify at least some users to test. I also wasn't sure if we would want the check to be root master, as in the current secure flag, or just master. W.r.t. the tests, I suppose we can just turn checking on for secure batches and see what happens. There are follow-up patches to libdrm and to i-g-t. The i-g-t tests are very basic and do not test all of the commands used by the parser on the assumption that I'm likely to make the same mistakes in both the parser and the test. Yeah, I agree that just checking whether commands all go through (or not) as expected adds very little value on top of the few tests you have done. I think we should take a look at some corner cases which might trip up your checker a bit though: - I think we should check batchbuffer chaining and make sure it works on the vcs ring and not anywhere else (we can't ever break shipping libva which uses this). - Some tests to trip up your parser should be done, like 3D commands that fall off the end of the batch bo. Or commands that span page boundaries. The later isn't an issue atm since you use vmap, but we should switch to per-page kmap since the vmap overhead is fairly horrible. Good suggestions. I'll look into these. Hi, Brad More inputs from libva about the batchbuffer chaining. Now the batchbuffer chaining is widely used in libva driver. This is related with how the libva driver processes the image. For the encoding purpose, it needs to be handled based on macroblock(16x16).And every macroblock needs a group of GPU commands. So the GPU commands for all the macroblocks will be constructed in the second-level batchbuffer. The mode of batchbuffer chaining will bring the following benefits: a. The size of second-level batch buffer can be allocated based on the size of handled image. For example: 1080p/720p/480p can use the different size. b. The gpu commands in second-level batchbuffer can be constructed by using GPU instead of CPU, which is helpful to improve the performance.
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Wed, Nov 27, 2013 at 9:23 AM, Xiang, Haihao haihao.xi...@intel.com wrote: So are these 2nd level batches constructed by the gpu in some cases? That would be fairly horribly to take into account with the batch checker ... It is *not* the 2nd level batch buffer (bit 22 isn't set). Only batch buffer chain is used. That's not really the hard part for the command checker, the important question is whether the gpu generates these batches or whether they're constructed by the cpu. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Wed, 2013-11-27 at 09:31 +0100, Daniel Vetter wrote: On Wed, Nov 27, 2013 at 9:23 AM, Xiang, Haihao haihao.xi...@intel.com wrote: So are these 2nd level batches constructed by the gpu in some cases? That would be fairly horribly to take into account with the batch checker ... It is *not* the 2nd level batch buffer (bit 22 isn't set). Only batch buffer chain is used. That's not really the hard part for the command checker, the important question is whether the gpu generates these batches or whether they're constructed by the cpu. Yes, some batches are generated by GPU, either by EU shaders or by BSD unit (batch buffer for MC on ILK). -Daniel ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Wed, Nov 27, 2013 at 04:42:11PM +0800, Xiang, Haihao wrote: On Wed, 2013-11-27 at 09:31 +0100, Daniel Vetter wrote: On Wed, Nov 27, 2013 at 9:23 AM, Xiang, Haihao haihao.xi...@intel.com wrote: So are these 2nd level batches constructed by the gpu in some cases? That would be fairly horribly to take into account with the batch checker ... It is *not* the 2nd level batch buffer (bit 22 isn't set). Only batch buffer chain is used. That's not really the hard part for the command checker, the important question is whether the gpu generates these batches or whether they're constructed by the cpu. Yes, some batches are generated by GPU, either by EU shaders or by BSD unit (batch buffer for MC on ILK). So is ilk the only platform which does that? The command checker is only for gen7+ (and maybe gen6). -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Wed, 2013-11-27 at 09:47 +0100, Daniel Vetter wrote: On Wed, Nov 27, 2013 at 04:42:11PM +0800, Xiang, Haihao wrote: On Wed, 2013-11-27 at 09:31 +0100, Daniel Vetter wrote: On Wed, Nov 27, 2013 at 9:23 AM, Xiang, Haihao haihao.xi...@intel.com wrote: So are these 2nd level batches constructed by the gpu in some cases? That would be fairly horribly to take into account with the batch checker ... It is *not* the 2nd level batch buffer (bit 22 isn't set). Only batch buffer chain is used. That's not really the hard part for the command checker, the important question is whether the gpu generates these batches or whether they're constructed by the cpu. Yes, some batches are generated by GPU, either by EU shaders or by BSD unit (batch buffer for MC on ILK). So is ilk the only platform which does that? The command checker is only for gen7+ (and maybe gen6). No. The Ivy/Haswell also uses the GPU to construct the batchbuffer. This is mainly for the optimization of performance. Thanks. Yakui -Daniel ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Wed, 2013-11-27 at 09:47 +0100, Daniel Vetter wrote: On Wed, Nov 27, 2013 at 04:42:11PM +0800, Xiang, Haihao wrote: On Wed, 2013-11-27 at 09:31 +0100, Daniel Vetter wrote: On Wed, Nov 27, 2013 at 9:23 AM, Xiang, Haihao haihao.xi...@intel.com wrote: So are these 2nd level batches constructed by the gpu in some cases? That would be fairly horribly to take into account with the batch checker ... It is *not* the 2nd level batch buffer (bit 22 isn't set). Only batch buffer chain is used. That's not really the hard part for the command checker, the important question is whether the gpu generates these batches or whether they're constructed by the cpu. Yes, some batches are generated by GPU, either by EU shaders or by BSD unit (batch buffer for MC on ILK). So is ilk the only platform which does that? The command checker is only for gen7+ (and maybe gen6). No. In libva some batches are generated by BSD unit on ILK. on Gen6+, some batches are constructed by GPU shader. -Daniel ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
[Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
From: Brad Volkin bradley.d.vol...@intel.com Certain OpenGL features (e.g. transform feedback, performance monitoring) require userspace code to submit batches containing commands such as MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some generations of the hardware will noop these commands in unsecure batches (which includes all userspace batches submitted via i915) even though the commands may be safe and represent the intended programming model of the device. This series introduces a software command parser similar in operation to the command parsing done in hardware for unsecure batches. However, the software parser allows some operations that would be noop'd by hardware, if the parser determines the operation is safe, and submits the batch as secure to prevent hardware parsing. Currently the series implements this on IVB and HSW. The series is divided into several phases: patches 01-09: These implement infrastructure and the command parsing algorithm, all behind a module parameter. I expect some discussion and rework, but hopefully there's nothing too controversial. patches 10-17: These define the checks performed by the parser. I expect much discussion :) patches 18-20: In a final pass over the command checks, I found some issues with the definitions. They looked painful to rebase in, so I've added them here. patches 21-22: These enable the parser by default. It runs on all batches except those that set the I915_EXEC_SECURE flag in the execbuffer2 call. There are follow-up patches to libdrm and to i-g-t. The i-g-t tests are very basic and do not test all of the commands used by the parser on the assumption that I'm likely to make the same mistakes in both the parser and the test. I've run the i-g-t gem_* tests, the piglit quick tests (w/Mesa git from a few days ago), and generally used an Ubuntu 13.10 IVB system with the parser running. Aside from a failure described below, I don't think there are any regressions. That is, piglit claims some regressions, but from manually running the tests I think these are false positives. However, I could use help in getting broader testing, particularly around performance. In general, I see less than 3% performance impact on HSW, with more like 10% impact for pathological batch sizes. But we'll certainly want to run relevant benchmarks beyond what I've done. At this point there are a couple of known issues and potential improvements. 1) VLV. The parser is currently disabled for VLV. One type of check performed by the parser is that commands which access memory do so via PPGTT. VLV does not have PPGTT enabled at this time. I chose to implement the PPGTT checks via generic bit checking infrastructure in the parser, so they are not easily disabled for VLV. For now, I'm disabling parsing altogether in the hope that PPGTT can be enabled for VLV in the near future. 2) Coherency. I've found two types of coherency issues when reading the batch buffer from the CPU during execbuffer2. Looking for help with both issues. i. First, the i-g-t test gem_cpu_reloc blits to a batch buffer and the parser isn't properly waiting for the operation to complete before parsing. I tried adding i915_gem_object_sync(batch_obj, [ring|NULL]) but that actually caused more failures. ii. Second, on VLV, I've seen cache coherency issues when userspace writes the batch via pwrite fast path before calling execbuffer2. The parser reads stale data. This works fine on IVB and HSW, so I believe it's an LLC vs. non-LLC issue. I'm just unclear on what the correct flushing or synchronization is for this scenario. 3) 2nd-level batches. The parser currently allows MI_BATCH_BUFFER_START commands in userspace batches without parsing them. The media driver uses 2nd-level batches, so a solution is required. I have some ideas but don't want to delay the review process for what I have so far. It may be that the 2nd-level parsing is only needed for VCS and the current code (plus rejecting BBS) would be sufficient for RCS. 4) Command buffer copy. To avoid CPU modifications to buffers after parsing, and to avoid GPU modifications to buffers via EUs or commands in the batch, we should copy the userspace batch buffer to memory that userspace does not have access to, map it into GGTT, and execute that batch buffer. I have a sense of how to do this for 1st-level batches, but it would need changes to tie in with the 2nd-level batch parsing I think, so I've again held off. Brad Volkin (22): drm/i915: Add data structures for command parser drm/i915: Initial command parser table definitions drm/i915: Hook command parser tables up to rings drm/i915: Add per-ring command length decode functions drm/i915: Implement command parsing drm/i915: Add a HAS_CMD_PARSER getparam drm/i915: Add support for
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
Hi Brad, On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com wrote: From: Brad Volkin bradley.d.vol...@intel.com Certain OpenGL features (e.g. transform feedback, performance monitoring) require userspace code to submit batches containing commands such as MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some generations of the hardware will noop these commands in unsecure batches (which includes all userspace batches submitted via i915) even though the commands may be safe and represent the intended programming model of the device. This series introduces a software command parser similar in operation to the command parsing done in hardware for unsecure batches. However, the software parser allows some operations that would be noop'd by hardware, if the parser determines the operation is safe, and submits the batch as secure to prevent hardware parsing. Currently the series implements this on IVB and HSW. The series is divided into several phases: patches 01-09: These implement infrastructure and the command parsing algorithm, all behind a module parameter. I expect some discussion and rework, but hopefully there's nothing too controversial. patches 10-17: These define the checks performed by the parser. I expect much discussion :) patches 18-20: In a final pass over the command checks, I found some issues with the definitions. They looked painful to rebase in, so I've added them here. patches 21-22: These enable the parser by default. It runs on all batches except those that set the I915_EXEC_SECURE flag in the execbuffer2 call. I think long-term we should even scan secure batches. We'd need to allow some registers which only the drm master (i.e. owner of the display hardware) is allowed to do, e.g. for scanline waits. But once we have that we should be able to port all current users of secure batches over to scanned batches and so enforce this everywhere by default. The other issue is that igt tests assume to be able to run some evil tests, so maybe we don't actually want this. There are follow-up patches to libdrm and to i-g-t. The i-g-t tests are very basic and do not test all of the commands used by the parser on the assumption that I'm likely to make the same mistakes in both the parser and the test. Yeah, I agree that just checking whether commands all go through (or not) as expected adds very little value on top of the few tests you have done. I think we should take a look at some corner cases which might trip up your checker a bit though: - I think we should check batchbuffer chaining and make sure it works on the vcs ring and not anywhere else (we can't ever break shipping libva which uses this). - Some tests to trip up your parser should be done, like 3D commands that fall off the end of the batch bo. Or commands that span page boundaries. The later isn't an issue atm since you use vmap, but we should switch to per-page kmap since the vmap overhead is fairly horrible. I've run the i-g-t gem_* tests, the piglit quick tests (w/Mesa git from a few days ago), and generally used an Ubuntu 13.10 IVB system with the parser running. Aside from a failure described below, I don't think there are any regressions. That is, piglit claims some regressions, but from manually running the tests I think these are false positives. However, I could use help in getting broader testing, particularly around performance. In general, I see less than 3% performance impact on HSW, with more like 10% impact for pathological batch sizes. But we'll certainly want to run relevant benchmarks beyond what I've done. Yeah, a microbenchmark that just shovels MI_NOP batches of various sizes through the checker and bypassing it (with EXEC_SECURE) would be really good. Maybe even some variable-sized commands (all the state setup stuff should be useful for that) to keep things interesting. Some variation is also important to have some good cache thrasing going on (since your check tables are fairly large I think). At this point there are a couple of known issues and potential improvements. 1) VLV. The parser is currently disabled for VLV. One type of check performed by the parser is that commands which access memory do so via PPGTT. VLV does not have PPGTT enabled at this time. I chose to implement the PPGTT checks via generic bit checking infrastructure in the parser, so they are not easily disabled for VLV. For now, I'm disabling parsing altogether in the hope that PPGTT can be enabled for VLV in the near future. We need ppgtt for the parser anyway, since otherwise userspace can submit a self-modifying batch. Checking for that is impossible, so allowing sw checked batches without the ppgtt/ggtt split would be a decent security hole. 2) Coherency. I've found two types of coherency issues when reading the
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Tue, Nov 26, 2013 at 11:35:38AM -0800, Daniel Vetter wrote: Hi Brad, On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com wrote: From: Brad Volkin bradley.d.vol...@intel.com Certain OpenGL features (e.g. transform feedback, performance monitoring) require userspace code to submit batches containing commands such as MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some generations of the hardware will noop these commands in unsecure batches (which includes all userspace batches submitted via i915) even though the commands may be safe and represent the intended programming model of the device. This series introduces a software command parser similar in operation to the command parsing done in hardware for unsecure batches. However, the software parser allows some operations that would be noop'd by hardware, if the parser determines the operation is safe, and submits the batch as secure to prevent hardware parsing. Currently the series implements this on IVB and HSW. The series is divided into several phases: patches 01-09: These implement infrastructure and the command parsing algorithm, all behind a module parameter. I expect some discussion and rework, but hopefully there's nothing too controversial. patches 10-17: These define the checks performed by the parser. I expect much discussion :) patches 18-20: In a final pass over the command checks, I found some issues with the definitions. They looked painful to rebase in, so I've added them here. patches 21-22: These enable the parser by default. It runs on all batches except those that set the I915_EXEC_SECURE flag in the execbuffer2 call. I think long-term we should even scan secure batches. We'd need to allow some registers which only the drm master (i.e. owner of the display hardware) is allowed to do, e.g. for scanline waits. But once we have that we should be able to port all current users of secure batches over to scanned batches and so enforce this everywhere by default. The other issue is that igt tests assume to be able to run some evil tests, so maybe we don't actually want this. Agreed. I thought we could handle this as a follow-up task once the basic stuff is in place, particularly given that we'd want to modify at least some users to test. I also wasn't sure if we would want the check to be root master, as in the current secure flag, or just master. W.r.t. the tests, I suppose we can just turn checking on for secure batches and see what happens. There are follow-up patches to libdrm and to i-g-t. The i-g-t tests are very basic and do not test all of the commands used by the parser on the assumption that I'm likely to make the same mistakes in both the parser and the test. Yeah, I agree that just checking whether commands all go through (or not) as expected adds very little value on top of the few tests you have done. I think we should take a look at some corner cases which might trip up your checker a bit though: - I think we should check batchbuffer chaining and make sure it works on the vcs ring and not anywhere else (we can't ever break shipping libva which uses this). - Some tests to trip up your parser should be done, like 3D commands that fall off the end of the batch bo. Or commands that span page boundaries. The later isn't an issue atm since you use vmap, but we should switch to per-page kmap since the vmap overhead is fairly horrible. Good suggestions. I'll look into these. I've run the i-g-t gem_* tests, the piglit quick tests (w/Mesa git from a few days ago), and generally used an Ubuntu 13.10 IVB system with the parser running. Aside from a failure described below, I don't think there are any regressions. That is, piglit claims some regressions, but from manually running the tests I think these are false positives. However, I could use help in getting broader testing, particularly around performance. In general, I see less than 3% performance impact on HSW, with more like 10% impact for pathological batch sizes. But we'll certainly want to run relevant benchmarks beyond what I've done. Yeah, a microbenchmark that just shovels MI_NOP batches of various sizes through the checker and bypassing it (with EXEC_SECURE) would be really good. Maybe even some variable-sized commands (all the state setup stuff should be useful for that) to keep things interesting. Some variation is also important to have some good cache thrasing going on (since your check tables are fairly large I think). Ok. I'd be interested in some comment from the mesa and libva guys here for real world workloads, but a microbenchmark would be a good start. Which state setup stuff are you referring to? Something specific in i-g-t or something more general? At this point
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Tue, 2013-11-26 at 20:35 +0100, Daniel Vetter wrote: Hi Brad, On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com wrote: From: Brad Volkin bradley.d.vol...@intel.com Certain OpenGL features (e.g. transform feedback, performance monitoring) require userspace code to submit batches containing commands such as MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some generations of the hardware will noop these commands in unsecure batches (which includes all userspace batches submitted via i915) even though the commands may be safe and represent the intended programming model of the device. This series introduces a software command parser similar in operation to the command parsing done in hardware for unsecure batches. However, the software parser allows some operations that would be noop'd by hardware, if the parser determines the operation is safe, and submits the batch as secure to prevent hardware parsing. Currently the series implements this on IVB and HSW. The series is divided into several phases: patches 01-09: These implement infrastructure and the command parsing algorithm, all behind a module parameter. I expect some discussion and rework, but hopefully there's nothing too controversial. patches 10-17: These define the checks performed by the parser. I expect much discussion :) patches 18-20: In a final pass over the command checks, I found some issues with the definitions. They looked painful to rebase in, so I've added them here. patches 21-22: These enable the parser by default. It runs on all batches except those that set the I915_EXEC_SECURE flag in the execbuffer2 call. I think long-term we should even scan secure batches. We'd need to allow some registers which only the drm master (i.e. owner of the display hardware) is allowed to do, e.g. for scanline waits. But once we have that we should be able to port all current users of secure batches over to scanned batches and so enforce this everywhere by default. The other issue is that igt tests assume to be able to run some evil tests, so maybe we don't actually want this. There are follow-up patches to libdrm and to i-g-t. The i-g-t tests are very basic and do not test all of the commands used by the parser on the assumption that I'm likely to make the same mistakes in both the parser and the test. Yeah, I agree that just checking whether commands all go through (or not) as expected adds very little value on top of the few tests you have done. I think we should take a look at some corner cases which might trip up your checker a bit though: - I think we should check batchbuffer chaining and make sure it works on the vcs ring and not anywhere else (we can't ever break shipping libva which uses this). Besides the vcs ring, we also use batchbuffer chaining on the render ring for video post processing, video motion estimation and motion compensation(on ILK), A fixed length batch buffer isn't suitable for those operations as those operations are based on a macroblock instead of a frame. It would be better to make sure batchbuffer chaining works on the render ring too. - Some tests to trip up your parser should be done, like 3D commands that fall off the end of the batch bo. Or commands that span page boundaries. The later isn't an issue atm since you use vmap, but we should switch to per-page kmap since the vmap overhead is fairly horrible. I've run the i-g-t gem_* tests, the piglit quick tests (w/Mesa git from a few days ago), and generally used an Ubuntu 13.10 IVB system with the parser running. Aside from a failure described below, I don't think there are any regressions. That is, piglit claims some regressions, but from manually running the tests I think these are false positives. However, I could use help in getting broader testing, particularly around performance. In general, I see less than 3% performance impact on HSW, with more like 10% impact for pathological batch sizes. But we'll certainly want to run relevant benchmarks beyond what I've done. Yeah, a microbenchmark that just shovels MI_NOP batches of various sizes through the checker and bypassing it (with EXEC_SECURE) would be really good. Maybe even some variable-sized commands (all the state setup stuff should be useful for that) to keep things interesting. Some variation is also important to have some good cache thrasing going on (since your check tables are fairly large I think). At this point there are a couple of known issues and potential improvements. 1) VLV. The parser is currently disabled for VLV. One type of check performed by the parser is that commands which access memory do so via PPGTT. VLV does not have PPGTT enabled at this time. I chose to implement the
Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser
On Tue, 2013-11-26 at 13:24 -0700, Volkin, Bradley D wrote: On Tue, Nov 26, 2013 at 11:35:38AM -0800, Daniel Vetter wrote: Hi Brad, On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com wrote: From: Brad Volkin bradley.d.vol...@intel.com Certain OpenGL features (e.g. transform feedback, performance monitoring) require userspace code to submit batches containing commands such as MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some generations of the hardware will noop these commands in unsecure batches (which includes all userspace batches submitted via i915) even though the commands may be safe and represent the intended programming model of the device. This series introduces a software command parser similar in operation to the command parsing done in hardware for unsecure batches. However, the software parser allows some operations that would be noop'd by hardware, if the parser determines the operation is safe, and submits the batch as secure to prevent hardware parsing. Currently the series implements this on IVB and HSW. The series is divided into several phases: patches 01-09: These implement infrastructure and the command parsing algorithm, all behind a module parameter. I expect some discussion and rework, but hopefully there's nothing too controversial. patches 10-17: These define the checks performed by the parser. I expect much discussion :) patches 18-20: In a final pass over the command checks, I found some issues with the definitions. They looked painful to rebase in, so I've added them here. patches 21-22: These enable the parser by default. It runs on all batches except those that set the I915_EXEC_SECURE flag in the execbuffer2 call. I think long-term we should even scan secure batches. We'd need to allow some registers which only the drm master (i.e. owner of the display hardware) is allowed to do, e.g. for scanline waits. But once we have that we should be able to port all current users of secure batches over to scanned batches and so enforce this everywhere by default. The other issue is that igt tests assume to be able to run some evil tests, so maybe we don't actually want this. Agreed. I thought we could handle this as a follow-up task once the basic stuff is in place, particularly given that we'd want to modify at least some users to test. I also wasn't sure if we would want the check to be root master, as in the current secure flag, or just master. W.r.t. the tests, I suppose we can just turn checking on for secure batches and see what happens. There are follow-up patches to libdrm and to i-g-t. The i-g-t tests are very basic and do not test all of the commands used by the parser on the assumption that I'm likely to make the same mistakes in both the parser and the test. Yeah, I agree that just checking whether commands all go through (or not) as expected adds very little value on top of the few tests you have done. I think we should take a look at some corner cases which might trip up your checker a bit though: - I think we should check batchbuffer chaining and make sure it works on the vcs ring and not anywhere else (we can't ever break shipping libva which uses this). - Some tests to trip up your parser should be done, like 3D commands that fall off the end of the batch bo. Or commands that span page boundaries. The later isn't an issue atm since you use vmap, but we should switch to per-page kmap since the vmap overhead is fairly horrible. Good suggestions. I'll look into these. Hi, Brad More inputs from libva about the batchbuffer chaining. Now the batchbuffer chaining is widely used in libva driver. This is related with how the libva driver processes the image. For the encoding purpose, it needs to be handled based on macroblock(16x16).And every macroblock needs a group of GPU commands. So the GPU commands for all the macroblocks will be constructed in the second-level batchbuffer. The mode of batchbuffer chaining will bring the following benefits: a. The size of second-level batch buffer can be allocated based on the size of handled image. For example: 1080p/720p/480p can use the different size. b. The gpu commands in second-level batchbuffer can be constructed by using GPU instead of CPU, which is helpful to improve the performance. At the same time both VCS and Render Ring are used in libva driver. For example: The encoding will use VCS and RCS ring. Firstly the RCS ring is used to execute GPU command for the motion vector/mode prediction. And then the VCS Ring is used to execute the GPU command for generating the bit-stream. So not only VCS ring uses the mode of batchbuffer chaining, but also the Render