Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2014-02-05 Thread Chris Wilson
On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com wrote:
 From: Brad Volkin bradley.d.vol...@intel.com
 
 Certain OpenGL features (e.g. transform feedback, performance monitoring)
 require userspace code to submit batches containing commands such as
 MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some
 generations of the hardware will noop these commands in unsecure batches
 (which includes all userspace batches submitted via i915) even though the
 commands may be safe and represent the intended programming model of the 
 device.
 
 This series introduces a software command parser similar in operation to the
 command parsing done in hardware for unsecure batches. However, the software
 parser allows some operations that would be noop'd by hardware, if the parser
 determines the operation is safe, and submits the batch as secure to prevent
 hardware parsing. Currently the series implements this on IVB and HSW.

Just one more question... Do you have a branch for people to test?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2014-02-05 Thread Chris Wilson
On Wed, Feb 05, 2014 at 10:18:44AM -0800, Volkin, Bradley D wrote:
 On Wed, Feb 05, 2014 at 02:28:29AM -0800, Chris Wilson wrote:
  On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com wrote:
   From: Brad Volkin bradley.d.vol...@intel.com
   
   Certain OpenGL features (e.g. transform feedback, performance monitoring)
   require userspace code to submit batches containing commands such as
   MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some
   generations of the hardware will noop these commands in unsecure batches
   (which includes all userspace batches submitted via i915) even though the
   commands may be safe and represent the intended programming model of the 
   device.
   
   This series introduces a software command parser similar in operation to 
   the
   command parsing done in hardware for unsecure batches. However, the 
   software
   parser allows some operations that would be noop'd by hardware, if the 
   parser
   determines the operation is safe, and submits the batch as secure to 
   prevent
   hardware parsing. Currently the series implements this on IVB and HSW.
  
  Just one more question... Do you have a branch for people to test?
 
 Not at the moment. And as mentioned in the v2 cover letter, it's actually not
 particularly testable (or mergeable for that matter) right now because of a
 regression in secure dispatch on nightly.

At this moment, I just want to be sure that the fixed dispatch overhead has
been minimised.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2014-02-05 Thread Volkin, Bradley D
On Wed, Feb 05, 2014 at 02:28:29AM -0800, Chris Wilson wrote:
 On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com wrote:
  From: Brad Volkin bradley.d.vol...@intel.com
  
  Certain OpenGL features (e.g. transform feedback, performance monitoring)
  require userspace code to submit batches containing commands such as
  MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some
  generations of the hardware will noop these commands in unsecure batches
  (which includes all userspace batches submitted via i915) even though the
  commands may be safe and represent the intended programming model of the 
  device.
  
  This series introduces a software command parser similar in operation to the
  command parsing done in hardware for unsecure batches. However, the software
  parser allows some operations that would be noop'd by hardware, if the 
  parser
  determines the operation is safe, and submits the batch as secure to 
  prevent
  hardware parsing. Currently the series implements this on IVB and HSW.
 
 Just one more question... Do you have a branch for people to test?

Not at the moment. And as mentioned in the v2 cover letter, it's actually not
particularly testable (or mergeable for that matter) right now because of a
regression in secure dispatch on nightly.

 -Chris
 
 -- 
 Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2014-02-05 Thread Daniel Vetter
On Wed, Feb 5, 2014 at 7:18 PM, Volkin, Bradley D
bradley.d.vol...@intel.com wrote:
 On Wed, Feb 05, 2014 at 02:28:29AM -0800, Chris Wilson wrote:
 On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com wrote:
  From: Brad Volkin bradley.d.vol...@intel.com
 
  Certain OpenGL features (e.g. transform feedback, performance monitoring)
  require userspace code to submit batches containing commands such as
  MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some
  generations of the hardware will noop these commands in unsecure batches
  (which includes all userspace batches submitted via i915) even though the
  commands may be safe and represent the intended programming model of the 
  device.
 
  This series introduces a software command parser similar in operation to 
  the
  command parsing done in hardware for unsecure batches. However, the 
  software
  parser allows some operations that would be noop'd by hardware, if the 
  parser
  determines the operation is safe, and submits the batch as secure to 
  prevent
  hardware parsing. Currently the series implements this on IVB and HSW.

 Just one more question... Do you have a branch for people to test?

 Not at the moment. And as mentioned in the v2 cover letter, it's actually not
 particularly testable (or mergeable for that matter) right now because of a
 regression in secure dispatch on nightly.

The command parser itself should still work, even with the regression
in -nightly. The copying and secure dispatch are obviously fail atm.
That still leaves regression testing of current userspace and
micro-optimizing the checker itself as possible things to do. Otoh not
sure what exactly Chris wanted to test.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2014-02-05 Thread Volkin, Bradley D
On Wed, Feb 05, 2014 at 10:30:00AM -0800, Daniel Vetter wrote:
 On Wed, Feb 5, 2014 at 7:18 PM, Volkin, Bradley D
 bradley.d.vol...@intel.com wrote:
  On Wed, Feb 05, 2014 at 02:28:29AM -0800, Chris Wilson wrote:
  On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com wrote:
   From: Brad Volkin bradley.d.vol...@intel.com
  
   Certain OpenGL features (e.g. transform feedback, performance monitoring)
   require userspace code to submit batches containing commands such as
   MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some
   generations of the hardware will noop these commands in unsecure 
   batches
   (which includes all userspace batches submitted via i915) even though the
   commands may be safe and represent the intended programming model of the 
   device.
  
   This series introduces a software command parser similar in operation to 
   the
   command parsing done in hardware for unsecure batches. However, the 
   software
   parser allows some operations that would be noop'd by hardware, if the 
   parser
   determines the operation is safe, and submits the batch as secure to 
   prevent
   hardware parsing. Currently the series implements this on IVB and HSW.
 
  Just one more question... Do you have a branch for people to test?
 
  Not at the moment. And as mentioned in the v2 cover letter, it's actually 
  not
  particularly testable (or mergeable for that matter) right now because of a
  regression in secure dispatch on nightly.
 
 The command parser itself should still work, even with the regression
 in -nightly. The copying and secure dispatch are obviously fail atm.
 That still leaves regression testing of current userspace and
 micro-optimizing the checker itself as possible things to do. Otoh not
 sure what exactly Chris wanted to test.

To test/merge, we'd have to change the series to take out the part where
patch 02/13 sets I915_DISPATCH_SECURE to avoid a BUG_ON() when 
i915.enable_cmd_parser=1.
But yes, otherwise the parsing works and I think should be sufficient for
what Chris indicated he wants to test.

- Brad

 -Daniel
 -- 
 Daniel Vetter
 Software Engineer, Intel Corporation
 +41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2014-02-05 Thread Daniel Vetter
On Wed, Feb 5, 2014 at 8:00 PM, Volkin, Bradley D
bradley.d.vol...@intel.com wrote:
 To test/merge, we'd have to change the series to take out the part where
 patch 02/13 sets I915_DISPATCH_SECURE to avoid a BUG_ON() when 
 i915.enable_cmd_parser=1.
 But yes, otherwise the parsing works and I think should be sufficient for
 what Chris indicated he wants to test.

Oh, I didn't spot this but this needs to be moved way back in the
series - we can only set the bit once we have the batchbuffer copy
logic in place. Otherwise there's a security hole open since userspace
is free to frob the batch residing in the ppgtt, which we just can't
prevent.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2014-02-05 Thread Volkin, Bradley D
On Wed, Feb 05, 2014 at 11:17:25AM -0800, Daniel Vetter wrote:
 On Wed, Feb 5, 2014 at 8:00 PM, Volkin, Bradley D
 bradley.d.vol...@intel.com wrote:
  To test/merge, we'd have to change the series to take out the part where
  patch 02/13 sets I915_DISPATCH_SECURE to avoid a BUG_ON() when 
  i915.enable_cmd_parser=1.
  But yes, otherwise the parsing works and I think should be sufficient for
  what Chris indicated he wants to test.
 
 Oh, I didn't spot this but this needs to be moved way back in the
 series - we can only set the bit once we have the batchbuffer copy
 logic in place. Otherwise there's a security hole open since userspace
 is free to frob the batch residing in the ppgtt, which we just can't
 prevent.

Good point. I'll take it out and we can add it as part of the batch copy work.

 -Daniel
 -- 
 Daniel Vetter
 Software Engineer, Intel Corporation
 +41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-12-11 Thread Daniel Vetter
On Tue, Dec 10, 2013 at 04:58:18PM -0800, Volkin, Bradley D wrote:
 So, I have a functioning kmap_atomic based parser using an sg_mapping_iter, 
 and in the
 tests I'm running, it's worse than the vmap approach. This is still without 
 the batch
 copy, but I think it's relevant anyhow. I haven't done much in the way of 
 analysis as to
 why we're getting these results. At a high level, the vmap approach leads to 
 simple
 code with a few function calls and control structures. The per-page approach 
 has
 somewhat more complex logic around mapping the next page at the right time 
 and checking
 for commands that cross a page boundary. I'd guess that difference factors 
 into it.

 Due to the risk of regressions, I think it would be better not to delay 
 getting
 broader functional and performance testing by waiting to sort this out. I'd 
 rather
 stick with vmap for now and revisit that overhead once we have more concrete
 performance data to work from.

Yeah, makes sense. Just to check: Was that on hsw with llc coherency or on
vlv? Might be that the clflushing shifts the picture around a bit.

 I'll propose that I send an updated series later this week or next that 
 addresses
 feedback from the review, including better handling for secure and chained 
 batches,
 the sync fixes for gem_cpu_reloc, some of the additional tests you suggested,
 and possibly an attempt at batch copy. If that goes well, could we see about
 getting the patches into the hands of your QA team for further testing?

We unfortunately don't really have tons of spare cycles from our QA team
for testing branches (pretty much none actually), so the usual approach is
to review and merge patches without first going through QA. If we pull in
your new i-g-ts first we should have decent assurance that nothing blows
up. And since kernel patch series should always be fully bisectable we can
stop at any point in time if something goes wrong.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-12-11 Thread Volkin, Bradley D
On Wed, Dec 11, 2013 at 01:54:40AM -0800, Daniel Vetter wrote:
 On Tue, Dec 10, 2013 at 04:58:18PM -0800, Volkin, Bradley D wrote:
  So, I have a functioning kmap_atomic based parser using an sg_mapping_iter, 
  and in the
  tests I'm running, it's worse than the vmap approach. This is still without 
  the batch
  copy, but I think it's relevant anyhow. I haven't done much in the way of 
  analysis as to
  why we're getting these results. At a high level, the vmap approach leads 
  to simple
  code with a few function calls and control structures. The per-page 
  approach has
  somewhat more complex logic around mapping the next page at the right time 
  and checking
  for commands that cross a page boundary. I'd guess that difference factors 
  into it.
 
  Due to the risk of regressions, I think it would be better not to delay 
  getting
  broader functional and performance testing by waiting to sort this out. I'd 
  rather
  stick with vmap for now and revisit that overhead once we have more concrete
  performance data to work from.
 
 Yeah, makes sense. Just to check: Was that on hsw with llc coherency or on
 vlv? Might be that the clflushing shifts the picture around a bit.

IVB and HSW. There's now a wait_rendering() call that should cover the 
gem_cpu_reloc
case. I haven't had a chance to go back and test on VLV to see if the 
clflushing helps
with the other coherency issue.

 
  I'll propose that I send an updated series later this week or next that 
  addresses
  feedback from the review, including better handling for secure and chained 
  batches,
  the sync fixes for gem_cpu_reloc, some of the additional tests you 
  suggested,
  and possibly an attempt at batch copy. If that goes well, could we see about
  getting the patches into the hands of your QA team for further testing?
 
 We unfortunately don't really have tons of spare cycles from our QA team
 for testing branches (pretty much none actually), so the usual approach is
 to review and merge patches without first going through QA. If we pull in
 your new i-g-ts first we should have decent assurance that nothing blows
 up. And since kernel patch series should always be fully bisectable we can
 stop at any point in time if something goes wrong.

Ok, sounds good. I'm fine with whatever approach gets us the test coverage 
soonest.

Brad

 -Daniel
 -- 
 Daniel Vetter
 Software Engineer, Intel Corporation
 +41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-12-11 Thread Daniel Vetter
On Wed, Dec 11, 2013 at 7:04 PM, Volkin, Bradley D
bradley.d.vol...@intel.com wrote:
 We unfortunately don't really have tons of spare cycles from our QA team
 for testing branches (pretty much none actually), so the usual approach is
 to review and merge patches without first going through QA. If we pull in
 your new i-g-ts first we should have decent assurance that nothing blows
 up. And since kernel patch series should always be fully bisectable we can
 stop at any point in time if something goes wrong.

 Ok, sounds good. I'm fine with whatever approach gets us the test coverage 
 soonest.

One thing that's always important is to get tangential prep work to
the beginning of your patch series as much as possible. That way we
can mergetest those glue patches (and their impact on the rest of the
driver) even when review is blocked on some contentious topic that
affects the core of a new feature.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-12-10 Thread Volkin, Bradley D
[snip]

On Tue, Nov 26, 2013 at 11:35:38AM -0800, Daniel Vetter wrote:
  2) Coherency. I've found two types of coherency issues when reading the 
  batch
 buffer from the CPU during execbuffer2. Looking for help with both 
  issues.
  i. First, the i-g-t test gem_cpu_reloc blits to a batch buffer and the
 parser isn't properly waiting for the operation to complete before
 parsing. I tried adding i915_gem_object_sync(batch_obj, [ring|NULL])
 but that actually caused more failures.
 
 This synchronization should happen when processing the relocations. The
 batch itself isn't modified by the gpu, we simply upload it using the
 blitter. So this going wrong indicates there's some issue somewhere ...

I looked at this again. The blitter copy uploads a batch full of noops and there
are no relocations associated with the failing batch, as far as I can tell. I
suspect that would explain why relocation handling wouldn't catch this. In any 
case,
I copied the relevant setup code from shmem pread, and it resolved the issue.

 
 
 ii. Second, on VLV, I've seen cache coherency issues when userspace 
  writes
 the batch via pwrite fast path before calling execbuffer2. The parser
 reads stale data. This works fine on IVB and HSW, so I believe it's 
  an
 LLC vs. non-LLC issue. I'm just unclear on what the correct flushing 
  or
 synchronization is for this scenario.
 
 Imo we take a good look at the optimized buffer read/write code from
 i915_gem_shmem_pread (for reading the userspace batch) and
 i915_gem_shmem_pwrite (for writing to the checked buffer). If we do the
 checking in-line with the reading this should also bring down the overhead
 massively compared to the current solution (those shmem read/write
 functions are fairly optimized).

So, I have a functioning kmap_atomic based parser using an sg_mapping_iter, and 
in the
tests I'm running, it's worse than the vmap approach. This is still without the 
batch
copy, but I think it's relevant anyhow. I haven't done much in the way of 
analysis as to
why we're getting these results. At a high level, the vmap approach leads to 
simple
code with a few function calls and control structures. The per-page approach has
somewhat more complex logic around mapping the next page at the right time and 
checking
for commands that cross a page boundary. I'd guess that difference factors into 
it.

Due to the risk of regressions, I think it would be better not to delay getting
broader functional and performance testing by waiting to sort this out. I'd 
rather
stick with vmap for now and revisit that overhead once we have more concrete
performance data to work from.

I'll propose that I send an updated series later this week or next that 
addresses
feedback from the review, including better handling for secure and chained 
batches,
the sync fixes for gem_cpu_reloc, some of the additional tests you suggested,
and possibly an attempt at batch copy. If that goes well, could we see about
getting the patches into the hands of your QA team for further testing?

Brad

 
  3) 2nd-level batches. The parser currently allows MI_BATCH_BUFFER_START 
  commands
 in userspace batches without parsing them. The media driver uses 
  2nd-level
 batches, so a solution is required. I have some ideas but don't want to 
  delay
 the review process for what I have so far. It may be that the 2nd-level
 parsing is only needed for VCS and the current code (plus rejecting BBS)
 would be sufficient for RCS.
 
 Afaik only libva uses second-level batches, and only on the vcs. So I hope
 we can just submit those as unpriviledged batches if possible. If that's
 not possible it'll get fairly ugly I fear :(
 
  4) Command buffer copy. To avoid CPU modifications to buffers after 
  parsing, and
 to avoid GPU modifications to buffers via EUs or commands in the batch, 
  we
 should copy the userspace batch buffer to memory that userspace does not
 have access to, map it into GGTT, and execute that batch buffer. I have a
 sense of how to do this for 1st-level batches, but it would need changes 
  to
 tie in with the 2nd-level batch parsing I think, so I've again held off.
 
 Yeah, we need the copying for otherwise the parsing is fairly pointless.
 I've stumbled over some of your internally patches and had a quick look at
 them. Two high-level comments:
 
 - Using the existing active buffer lru instead of manual pinning would
   better integrate with the eviction code. For an example of in-kernel
   objects (and not userspace objects) using this look at the hw context
   code.
 - Imo we should tag all buffers as purgeable while they're in the cache.
   That way the shrinker will automatically drop the backing storage if
   memory runs low (and thanks to the active list lru only when the gpu has
   stopped processing the batch). That way we can just keep on allocating
   buffers if they're all busy without any concern 

Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-12-05 Thread Volkin, Bradley D
On Tue, Nov 26, 2013 at 12:24:14PM -0800, Volkin, Bradley D wrote:
 On Tue, Nov 26, 2013 at 11:35:38AM -0800, Daniel Vetter wrote:
  I think long-term we should even scan secure batches. We'd need to allow
  some registers which only the drm master (i.e. owner of the display
  hardware) is allowed to do, e.g. for scanline waits. But once we have that
  we should be able to port all current users of secure batches over to
  scanned batches and so enforce this everywhere by default.
  
  The other issue is that igt tests assume to be able to run some evil
  tests, so maybe we don't actually want this.
 
 Agreed. I thought we could handle this as a follow-up task once the basic 
 stuff is
 in place, particularly given that we'd want to modify at least some users to 
 test.
 I also wasn't sure if we would want the check to be root  master, as in the 
 current
 secure flag, or just master.

So my plan to initially not parse secure batches might be shot. During further 
testing,
I found that it looks like Ubuntu 13.10 ships with fdo bug 71328 out of the box 
(sna
doesn't set the EXEC_SECURE flag when doing scanline waits). Sooo...

If we parse all batches and allow extra commands/registers from the drm master, 
should
that list just be the commands/registers used for scanline waits? Are there 
others you
can think of?

Brad
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-12-05 Thread Daniel Vetter
On Thu, Dec 5, 2013 at 9:47 PM, Volkin, Bradley D
bradley.d.vol...@intel.com wrote:
 On Tue, Nov 26, 2013 at 12:24:14PM -0800, Volkin, Bradley D wrote:
 On Tue, Nov 26, 2013 at 11:35:38AM -0800, Daniel Vetter wrote:
  I think long-term we should even scan secure batches. We'd need to allow
  some registers which only the drm master (i.e. owner of the display
  hardware) is allowed to do, e.g. for scanline waits. But once we have that
  we should be able to port all current users of secure batches over to
  scanned batches and so enforce this everywhere by default.
 
  The other issue is that igt tests assume to be able to run some evil
  tests, so maybe we don't actually want this.

 Agreed. I thought we could handle this as a follow-up task once the basic 
 stuff is
 in place, particularly given that we'd want to modify at least some users to 
 test.
 I also wasn't sure if we would want the check to be root  master, as in 
 the current
 secure flag, or just master.

 So my plan to initially not parse secure batches might be shot. During 
 further testing,
 I found that it looks like Ubuntu 13.10 ships with fdo bug 71328 out of the 
 box (sna
 doesn't set the EXEC_SECURE flag when doing scanline waits). Sooo...

 If we parse all batches and allow extra commands/registers from the drm 
 master, should
 that list just be the commands/registers used for scanline waits? Are there 
 others you
 can think of?

No, I think the additional stuff the drm master is allowed to do is
just the scanline waits. Everything else that the ddx might or might
not be doing should also be possible as a normal process (e.g. the
BTILE register to switch the blitter into y-tiled mode). But Chris
should know if there's anything else really.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-12-04 Thread Daniel Vetter
On Wed, Dec 4, 2013 at 9:13 AM, Daniel Vetter dan...@ffwll.ch wrote:
 Ok, I'll look at the hw context code for buffer mgmt. For purgeable, just 
 via the
 madv field in the i915 gem object?

 Yeah, though I'd extract two tiny helpers (maybe shared with the madvise
 ioctl) to set an object to purgeable and then resurrect it. The later can
 obviously fail. The helpers are just so we have a place to throw debug
 asserts into, maybe there are other needs for in-kernel caches.

Since we've had tons of fun in the past few months with low memory
handling I think an evil testcase to exercise this cache purging logic
a bit would be good. Quick idea would be to submit dummy batches that
double in size every time (to ensure we have a miss in the cache),
with the last batch careful sized so that the userspace batch plus the
kernel copy will still fit, but not together with all the older cached
batches.

I expect that our other memory thrashing tests will also exercise the
cache purging, but having a specific test which doesn't exercise
anything else really helps to separate issues.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-12-04 Thread Volkin, Bradley D
On Wed, Dec 04, 2013 at 12:13:39AM -0800, Daniel Vetter wrote:
 On Tue, Nov 26, 2013 at 9:24 PM, Volkin, Bradley D 
 bradley.d.vol...@intel.com wrote:
 
 [snip]
 
  Which state setup stuff are you referring to? Something specific in i-g-t 
  or something
  more general?
 
 The state setup 3D commands as opposed to doing actual rendering commands
 (with 3D_PRIM). Just to have a bit more realistic cs opcode lengths for
 micro-benchmarking.
 
 [snip]
 
  Ok, I'll look at the hw context code for buffer mgmt. For purgeable, just 
  via the
  madv field in the i915 gem object?
 
 Yeah, though I'd extract two tiny helpers (maybe shared with the madvise
 ioctl) to set an object to purgeable and then resurrect it. The later can
 obviously fail. The helpers are just so we have a place to throw debug
 asserts into, maybe there are other needs for in-kernel caches.

Hmm. Not sure how much use the helpers would be. The ioctl has one case where 
it will
truncate() the object, but beyond that it just sets madv appropriately. In 
general,
purging and resurrecting appear to happen via a put_pages() from the shrinker 
and a
get_pages() the next time the object is needed.

Also, I haven't looked too closely at the shrinker code. Could there be 
potential races
between the shrinker purging a buffer and an execbuf call choosing it? I would 
expect
that synchronization is already in place. Just curious if you know off the top 
of your head.

Brad

 
  Also, there are a couple iterations of the work-in-progress patches. Do you 
  prefer a
  cache per ring or a single cache shared by all rings?
 
 I've pondered a bunch of reasons for/against the two approaches and I
 think it won't really matter. Maybe slightly leaning towards per-ring
 caches since then objects retire in order. Well, until we do preemption
 ;-)
 -Daniel
 -- 
 Daniel Vetter
 Software Engineer, Intel Corporation
 +41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-12-04 Thread Daniel Vetter
On Thu, Dec 5, 2013 at 2:40 AM, Volkin, Bradley D
bradley.d.vol...@intel.com wrote:
  Ok, I'll look at the hw context code for buffer mgmt. For purgeable, 
  just via the
  madv field in the i915 gem object?

 Yeah, though I'd extract two tiny helpers (maybe shared with the madvise
 ioctl) to set an object to purgeable and then resurrect it. The later can
 obviously fail. The helpers are just so we have a place to throw debug
 asserts into, maybe there are other needs for in-kernel caches.

 Hmm. Not sure how much use the helpers would be. The ioctl has one case where 
 it will
 truncate() the object, but beyond that it just sets madv appropriately. In 
 general,
 purging and resurrecting appear to happen via a put_pages() from the shrinker 
 and a
 get_pages() the next time the object is needed.

Yeah, the helper would be indeed minimal. But they'd allow us to
shovel checks like obj-pin_count or similar stuff in there to
sanity-check users.

 Also, I haven't looked too closely at the shrinker code. Could there be 
 potential races
 between the shrinker purging a buffer and an execbuf call choosing it? I 
 would expect
 that synchronization is already in place. Just curious if you know off the 
 top of your head.

As long as you try to set the object back to WILLNEED before you start
using it and free your reference if that fails (i.e. it's state is
PURGED) then the magic just happens.

I just reviewed the code a bit and it seems like we're missing a few
sanity checks and don't properly reject purgeable objects in e.g.
bind_to_vm.

For marking the object purgeable you only have to take care to do that
after its on the active list. The shrinker will then make sure to wait
for the gpu to complete processing before it drops the backing
storage. Misordering here would be caught by a obj-pin_count check,
hence why I've suggested that.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-11-27 Thread Daniel Vetter
On Wed, Nov 27, 2013 at 09:32:32AM +0800, ykzhao wrote:
 On Tue, 2013-11-26 at 13:24 -0700, Volkin, Bradley D wrote:
  On Tue, Nov 26, 2013 at 11:35:38AM -0800, Daniel Vetter wrote:
   Hi Brad,
   
   On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com 
   wrote:
From: Brad Volkin bradley.d.vol...@intel.com

Certain OpenGL features (e.g. transform feedback, performance 
monitoring)
require userspace code to submit batches containing commands such as
MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some
generations of the hardware will noop these commands in unsecure 
batches
(which includes all userspace batches submitted via i915) even though 
the
commands may be safe and represent the intended programming model of 
the device.

This series introduces a software command parser similar in operation 
to the
command parsing done in hardware for unsecure batches. However, the 
software
parser allows some operations that would be noop'd by hardware, if the 
parser
determines the operation is safe, and submits the batch as secure to 
prevent
hardware parsing. Currently the series implements this on IVB and HSW.

The series is divided into several phases:

patches 01-09: These implement infrastructure and the command parsing 
algorithm,
   all behind a module parameter. I expect some discussion 
and
   rework, but hopefully there's nothing too controversial.
patches 10-17: These define the checks performed by the parser.
   I expect much discussion :)
patches 18-20: In a final pass over the command checks, I found some 
issues with
   the definitions. They looked painful to rebase in, so 
I've added
   them here.
patches 21-22: These enable the parser by default. It runs on all 
batches except
   those that set the I915_EXEC_SECURE flag in the 
execbuffer2 call.
   
   I think long-term we should even scan secure batches. We'd need to allow
   some registers which only the drm master (i.e. owner of the display
   hardware) is allowed to do, e.g. for scanline waits. But once we have that
   we should be able to port all current users of secure batches over to
   scanned batches and so enforce this everywhere by default.
   
   The other issue is that igt tests assume to be able to run some evil
   tests, so maybe we don't actually want this.
  
  Agreed. I thought we could handle this as a follow-up task once the basic 
  stuff is
  in place, particularly given that we'd want to modify at least some users 
  to test.
  I also wasn't sure if we would want the check to be root  master, as in 
  the current
  secure flag, or just master.
  
  W.r.t. the tests, I suppose we can just turn checking on for secure batches 
  and see
  what happens.
  
   
There are follow-up patches to libdrm and to i-g-t. The i-g-t tests are 
very
basic and do not test all of the commands used by the parser on the 
assumption
that I'm likely to make the same mistakes in both the parser and the 
test.
   
   Yeah, I agree that just checking whether commands all go through (or not)
   as expected adds very little value on top of the few tests you have done.
   I think we should take a look at some corner cases which might trip up
   your checker a bit though:
   - I think we should check batchbuffer chaining and make sure it works on
 the vcs ring and not anywhere else (we can't ever break shipping libva
 which uses this).
   - Some tests to trip up your parser should be done, like 3D commands that
 fall off the end of the batch bo. Or commands that span page boundaries.
 The later isn't an issue atm since you use vmap, but we should switch to
 per-page kmap since the vmap overhead is fairly horrible.
  
  Good suggestions. I'll look into these.
 Hi, Brad
   More inputs from libva about the batchbuffer chaining.
 
   Now the batchbuffer chaining is widely used in libva driver. This
 is related with how the libva driver processes the image. For the
 encoding purpose, it needs to be handled based on macroblock(16x16).And
 every macroblock needs a group of GPU commands. So the GPU commands for
 all the macroblocks will be constructed in the second-level batchbuffer.
 The mode of batchbuffer chaining will bring the following benefits:
   a. The size of second-level batch buffer can be allocated based on
 the size of handled image. For example: 1080p/720p/480p can use the
 different size.
   b. The gpu commands in second-level batchbuffer can be constructed
 by using GPU instead of CPU, which is helpful to improve the
 performance. 
 
   At the same time both VCS and Render Ring are used in libva
 driver. For example: The encoding will use VCS and RCS ring. Firstly the
 RCS ring is used to execute GPU command for 

Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-11-27 Thread Xiang, Haihao
On Wed, 2013-11-27 at 09:10 +0100, Daniel Vetter wrote: 
 On Wed, Nov 27, 2013 at 09:32:32AM +0800, ykzhao wrote:
  On Tue, 2013-11-26 at 13:24 -0700, Volkin, Bradley D wrote:
   On Tue, Nov 26, 2013 at 11:35:38AM -0800, Daniel Vetter wrote:
Hi Brad,

On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com 
wrote:
 From: Brad Volkin bradley.d.vol...@intel.com
 
 Certain OpenGL features (e.g. transform feedback, performance 
 monitoring)
 require userspace code to submit batches containing commands such as
 MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some
 generations of the hardware will noop these commands in unsecure 
 batches
 (which includes all userspace batches submitted via i915) even though 
 the
 commands may be safe and represent the intended programming model of 
 the device.
 
 This series introduces a software command parser similar in operation 
 to the
 command parsing done in hardware for unsecure batches. However, the 
 software
 parser allows some operations that would be noop'd by hardware, if 
 the parser
 determines the operation is safe, and submits the batch as secure 
 to prevent
 hardware parsing. Currently the series implements this on IVB and HSW.
 
 The series is divided into several phases:
 
 patches 01-09: These implement infrastructure and the command parsing 
 algorithm,
all behind a module parameter. I expect some 
 discussion and
  rework, but hopefully there's nothing too controversial.
 patches 10-17: These define the checks performed by the parser.
I expect much discussion :)
 patches 18-20: In a final pass over the command checks, I found some 
 issues with
the definitions. They looked painful to rebase in, so 
 I've added
  them here.
 patches 21-22: These enable the parser by default. It runs on all 
 batches except
those that set the I915_EXEC_SECURE flag in the 
 execbuffer2 call.

I think long-term we should even scan secure batches. We'd need to allow
some registers which only the drm master (i.e. owner of the display
hardware) is allowed to do, e.g. for scanline waits. But once we have 
that
we should be able to port all current users of secure batches over to
scanned batches and so enforce this everywhere by default.

The other issue is that igt tests assume to be able to run some evil
tests, so maybe we don't actually want this.
   
   Agreed. I thought we could handle this as a follow-up task once the basic 
   stuff is
   in place, particularly given that we'd want to modify at least some users 
   to test.
   I also wasn't sure if we would want the check to be root  master, as in 
   the current
   secure flag, or just master.
   
   W.r.t. the tests, I suppose we can just turn checking on for secure 
   batches and see
   what happens.
   

 There are follow-up patches to libdrm and to i-g-t. The i-g-t tests 
 are very
 basic and do not test all of the commands used by the parser on the 
 assumption
 that I'm likely to make the same mistakes in both the parser and the 
 test.

Yeah, I agree that just checking whether commands all go through (or 
not)
as expected adds very little value on top of the few tests you have 
done.
I think we should take a look at some corner cases which might trip up
your checker a bit though:
- I think we should check batchbuffer chaining and make sure it works on
  the vcs ring and not anywhere else (we can't ever break shipping libva
  which uses this).
- Some tests to trip up your parser should be done, like 3D commands 
that
  fall off the end of the batch bo. Or commands that span page 
boundaries.
  The later isn't an issue atm since you use vmap, but we should switch 
to
  per-page kmap since the vmap overhead is fairly horrible.
   
   Good suggestions. I'll look into these.
  Hi, Brad
More inputs from libva about the batchbuffer chaining.
  
Now the batchbuffer chaining is widely used in libva driver. This
  is related with how the libva driver processes the image. For the
  encoding purpose, it needs to be handled based on macroblock(16x16).And
  every macroblock needs a group of GPU commands. So the GPU commands for
  all the macroblocks will be constructed in the second-level batchbuffer.
  The mode of batchbuffer chaining will bring the following benefits:
a. The size of second-level batch buffer can be allocated based on
  the size of handled image. For example: 1080p/720p/480p can use the
  different size.
b. The gpu commands in second-level batchbuffer can be constructed
  by using GPU instead of CPU, which is helpful to improve the
  performance. 
  

Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-11-27 Thread Daniel Vetter
On Wed, Nov 27, 2013 at 9:23 AM, Xiang, Haihao haihao.xi...@intel.com wrote:
 So are these 2nd level batches constructed by the gpu in some cases? That
 would be fairly horribly to take into account with the batch checker ...

 It is *not* the 2nd level batch buffer (bit 22 isn't set). Only batch
 buffer chain is used.

That's not really the hard part for the command checker, the important
question is whether the gpu generates these batches or whether they're
constructed by the cpu.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-11-27 Thread Xiang, Haihao
On Wed, 2013-11-27 at 09:31 +0100, Daniel Vetter wrote: 
 On Wed, Nov 27, 2013 at 9:23 AM, Xiang, Haihao haihao.xi...@intel.com wrote:
  So are these 2nd level batches constructed by the gpu in some cases? That
  would be fairly horribly to take into account with the batch checker ...
 
  It is *not* the 2nd level batch buffer (bit 22 isn't set). Only batch
  buffer chain is used.
 
 That's not really the hard part for the command checker, the important
 question is whether the gpu generates these batches or whether they're
 constructed by the cpu.

Yes, some batches are generated by GPU, either by EU shaders or by BSD
unit (batch buffer for MC on ILK).


 -Daniel


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-11-27 Thread Daniel Vetter
On Wed, Nov 27, 2013 at 04:42:11PM +0800, Xiang, Haihao wrote:
 On Wed, 2013-11-27 at 09:31 +0100, Daniel Vetter wrote: 
  On Wed, Nov 27, 2013 at 9:23 AM, Xiang, Haihao haihao.xi...@intel.com 
  wrote:
   So are these 2nd level batches constructed by the gpu in some cases? That
   would be fairly horribly to take into account with the batch checker ...
  
   It is *not* the 2nd level batch buffer (bit 22 isn't set). Only batch
   buffer chain is used.
  
  That's not really the hard part for the command checker, the important
  question is whether the gpu generates these batches or whether they're
  constructed by the cpu.
 
 Yes, some batches are generated by GPU, either by EU shaders or by BSD
 unit (batch buffer for MC on ILK).

So is ilk the only platform which does that? The command checker is only
for gen7+ (and maybe gen6).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-11-27 Thread ykzhao
On Wed, 2013-11-27 at 09:47 +0100, Daniel Vetter wrote:
 On Wed, Nov 27, 2013 at 04:42:11PM +0800, Xiang, Haihao wrote:
  On Wed, 2013-11-27 at 09:31 +0100, Daniel Vetter wrote: 
   On Wed, Nov 27, 2013 at 9:23 AM, Xiang, Haihao haihao.xi...@intel.com 
   wrote:
So are these 2nd level batches constructed by the gpu in some cases? 
That
would be fairly horribly to take into account with the batch checker 
...
   
It is *not* the 2nd level batch buffer (bit 22 isn't set). Only batch
buffer chain is used.
   
   That's not really the hard part for the command checker, the important
   question is whether the gpu generates these batches or whether they're
   constructed by the cpu.
  
  Yes, some batches are generated by GPU, either by EU shaders or by BSD
  unit (batch buffer for MC on ILK).
 
 So is ilk the only platform which does that? The command checker is only
 for gen7+ (and maybe gen6).

No. The Ivy/Haswell also uses the GPU to construct the batchbuffer. This
is mainly for the optimization of performance. 

Thanks.
 Yakui

 -Daniel


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-11-27 Thread Xiang, Haihao
On Wed, 2013-11-27 at 09:47 +0100, Daniel Vetter wrote: 
 On Wed, Nov 27, 2013 at 04:42:11PM +0800, Xiang, Haihao wrote:
  On Wed, 2013-11-27 at 09:31 +0100, Daniel Vetter wrote: 
   On Wed, Nov 27, 2013 at 9:23 AM, Xiang, Haihao haihao.xi...@intel.com 
   wrote:
So are these 2nd level batches constructed by the gpu in some cases? 
That
would be fairly horribly to take into account with the batch checker 
...
   
It is *not* the 2nd level batch buffer (bit 22 isn't set). Only batch
buffer chain is used.
   
   That's not really the hard part for the command checker, the important
   question is whether the gpu generates these batches or whether they're
   constructed by the cpu.
  
  Yes, some batches are generated by GPU, either by EU shaders or by BSD
  unit (batch buffer for MC on ILK).
 
 So is ilk the only platform which does that? The command checker is only
 for gen7+ (and maybe gen6).

No. In libva some batches are generated by BSD unit on ILK. on Gen6+,
some batches are constructed by GPU shader.


 -Daniel


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-11-26 Thread bradley . d . volkin
From: Brad Volkin bradley.d.vol...@intel.com

Certain OpenGL features (e.g. transform feedback, performance monitoring)
require userspace code to submit batches containing commands such as
MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some
generations of the hardware will noop these commands in unsecure batches
(which includes all userspace batches submitted via i915) even though the
commands may be safe and represent the intended programming model of the device.

This series introduces a software command parser similar in operation to the
command parsing done in hardware for unsecure batches. However, the software
parser allows some operations that would be noop'd by hardware, if the parser
determines the operation is safe, and submits the batch as secure to prevent
hardware parsing. Currently the series implements this on IVB and HSW.

The series is divided into several phases:

patches 01-09: These implement infrastructure and the command parsing algorithm,
   all behind a module parameter. I expect some discussion and
   rework, but hopefully there's nothing too controversial.
patches 10-17: These define the checks performed by the parser.
   I expect much discussion :)
patches 18-20: In a final pass over the command checks, I found some issues with
   the definitions. They looked painful to rebase in, so I've added
   them here.
patches 21-22: These enable the parser by default. It runs on all batches except
   those that set the I915_EXEC_SECURE flag in the execbuffer2 call.

There are follow-up patches to libdrm and to i-g-t. The i-g-t tests are very
basic and do not test all of the commands used by the parser on the assumption
that I'm likely to make the same mistakes in both the parser and the test.

I've run the i-g-t gem_* tests, the piglit quick tests (w/Mesa git from a few
days ago), and generally used an Ubuntu 13.10 IVB system with the parser
running. Aside from a failure described below, I don't think there are any
regressions. That is, piglit claims some regressions, but from manually running
the tests I think these are false positives. However, I could use help in
getting broader testing, particularly around performance. In general, I see less
than 3% performance impact on HSW, with more like 10% impact for pathological
batch sizes. But we'll certainly want to run relevant benchmarks beyond what
I've done.

At this point there are a couple of known issues and potential improvements.

1) VLV. The parser is currently disabled for VLV. One type of check performed by
   the parser is that commands which access memory do so via PPGTT. VLV does not
   have PPGTT enabled at this time. I chose to implement the PPGTT checks via
   generic bit checking infrastructure in the parser, so they are not easily
   disabled for VLV. For now, I'm disabling parsing altogether in the hope that
   PPGTT can be enabled for VLV in the near future.
2) Coherency. I've found two types of coherency issues when reading the batch
   buffer from the CPU during execbuffer2. Looking for help with both issues.
i. First, the i-g-t test gem_cpu_reloc blits to a batch buffer and the
   parser isn't properly waiting for the operation to complete before
   parsing. I tried adding i915_gem_object_sync(batch_obj, [ring|NULL])
   but that actually caused more failures.
   ii. Second, on VLV, I've seen cache coherency issues when userspace writes
   the batch via pwrite fast path before calling execbuffer2. The parser
   reads stale data. This works fine on IVB and HSW, so I believe it's an
   LLC vs. non-LLC issue. I'm just unclear on what the correct flushing or
   synchronization is for this scenario.
3) 2nd-level batches. The parser currently allows MI_BATCH_BUFFER_START commands
   in userspace batches without parsing them. The media driver uses 2nd-level
   batches, so a solution is required. I have some ideas but don't want to delay
   the review process for what I have so far. It may be that the 2nd-level
   parsing is only needed for VCS and the current code (plus rejecting BBS)
   would be sufficient for RCS.
4) Command buffer copy. To avoid CPU modifications to buffers after parsing, and
   to avoid GPU modifications to buffers via EUs or commands in the batch, we
   should copy the userspace batch buffer to memory that userspace does not
   have access to, map it into GGTT, and execute that batch buffer. I have a
   sense of how to do this for 1st-level batches, but it would need changes to
   tie in with the 2nd-level batch parsing I think, so I've again held off.

Brad Volkin (22):
  drm/i915: Add data structures for command parser
  drm/i915: Initial command parser table definitions
  drm/i915: Hook command parser tables up to rings
  drm/i915: Add per-ring command length decode functions
  drm/i915: Implement command parsing
  drm/i915: Add a HAS_CMD_PARSER getparam
  drm/i915: Add support for 

Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-11-26 Thread Daniel Vetter
Hi Brad,

On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com wrote:
 From: Brad Volkin bradley.d.vol...@intel.com
 
 Certain OpenGL features (e.g. transform feedback, performance monitoring)
 require userspace code to submit batches containing commands such as
 MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some
 generations of the hardware will noop these commands in unsecure batches
 (which includes all userspace batches submitted via i915) even though the
 commands may be safe and represent the intended programming model of the 
 device.
 
 This series introduces a software command parser similar in operation to the
 command parsing done in hardware for unsecure batches. However, the software
 parser allows some operations that would be noop'd by hardware, if the parser
 determines the operation is safe, and submits the batch as secure to prevent
 hardware parsing. Currently the series implements this on IVB and HSW.
 
 The series is divided into several phases:
 
 patches 01-09: These implement infrastructure and the command parsing 
 algorithm,
all behind a module parameter. I expect some discussion and
  rework, but hopefully there's nothing too controversial.
 patches 10-17: These define the checks performed by the parser.
I expect much discussion :)
 patches 18-20: In a final pass over the command checks, I found some issues 
 with
the definitions. They looked painful to rebase in, so I've 
 added
  them here.
 patches 21-22: These enable the parser by default. It runs on all batches 
 except
those that set the I915_EXEC_SECURE flag in the execbuffer2 
 call.

I think long-term we should even scan secure batches. We'd need to allow
some registers which only the drm master (i.e. owner of the display
hardware) is allowed to do, e.g. for scanline waits. But once we have that
we should be able to port all current users of secure batches over to
scanned batches and so enforce this everywhere by default.

The other issue is that igt tests assume to be able to run some evil
tests, so maybe we don't actually want this.

 There are follow-up patches to libdrm and to i-g-t. The i-g-t tests are very
 basic and do not test all of the commands used by the parser on the assumption
 that I'm likely to make the same mistakes in both the parser and the test.

Yeah, I agree that just checking whether commands all go through (or not)
as expected adds very little value on top of the few tests you have done.
I think we should take a look at some corner cases which might trip up
your checker a bit though:
- I think we should check batchbuffer chaining and make sure it works on
  the vcs ring and not anywhere else (we can't ever break shipping libva
  which uses this).
- Some tests to trip up your parser should be done, like 3D commands that
  fall off the end of the batch bo. Or commands that span page boundaries.
  The later isn't an issue atm since you use vmap, but we should switch to
  per-page kmap since the vmap overhead is fairly horrible.

 I've run the i-g-t gem_* tests, the piglit quick tests (w/Mesa git from a few
 days ago), and generally used an Ubuntu 13.10 IVB system with the parser
 running. Aside from a failure described below, I don't think there are any
 regressions. That is, piglit claims some regressions, but from manually 
 running
 the tests I think these are false positives. However, I could use help in
 getting broader testing, particularly around performance. In general, I see 
 less
 than 3% performance impact on HSW, with more like 10% impact for pathological
 batch sizes. But we'll certainly want to run relevant benchmarks beyond what
 I've done.

Yeah, a microbenchmark that just shovels MI_NOP batches of various sizes
through the checker and bypassing it (with EXEC_SECURE) would be really
good. Maybe even some variable-sized commands (all the state setup stuff
should be useful for that) to keep things interesting. Some variation is
also important to have some good cache thrasing going on (since your check
tables are fairly large I think).

 At this point there are a couple of known issues and potential improvements.
 
 1) VLV. The parser is currently disabled for VLV. One type of check performed 
 by
the parser is that commands which access memory do so via PPGTT. VLV does 
 not
have PPGTT enabled at this time. I chose to implement the PPGTT checks via
generic bit checking infrastructure in the parser, so they are not easily
disabled for VLV. For now, I'm disabling parsing altogether in the hope 
 that
PPGTT can be enabled for VLV in the near future.

We need ppgtt for the parser anyway, since otherwise userspace can submit
a self-modifying batch. Checking for that is impossible, so allowing sw
checked batches without the ppgtt/ggtt split would be a decent security
hole.

 2) Coherency. I've found two types of coherency issues when reading the 

Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-11-26 Thread Volkin, Bradley D
On Tue, Nov 26, 2013 at 11:35:38AM -0800, Daniel Vetter wrote:
 Hi Brad,
 
 On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com wrote:
  From: Brad Volkin bradley.d.vol...@intel.com
  
  Certain OpenGL features (e.g. transform feedback, performance monitoring)
  require userspace code to submit batches containing commands such as
  MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some
  generations of the hardware will noop these commands in unsecure batches
  (which includes all userspace batches submitted via i915) even though the
  commands may be safe and represent the intended programming model of the 
  device.
  
  This series introduces a software command parser similar in operation to the
  command parsing done in hardware for unsecure batches. However, the software
  parser allows some operations that would be noop'd by hardware, if the 
  parser
  determines the operation is safe, and submits the batch as secure to 
  prevent
  hardware parsing. Currently the series implements this on IVB and HSW.
  
  The series is divided into several phases:
  
  patches 01-09: These implement infrastructure and the command parsing 
  algorithm,
 all behind a module parameter. I expect some discussion and
 rework, but hopefully there's nothing too controversial.
  patches 10-17: These define the checks performed by the parser.
 I expect much discussion :)
  patches 18-20: In a final pass over the command checks, I found some issues 
  with
 the definitions. They looked painful to rebase in, so I've 
  added
 them here.
  patches 21-22: These enable the parser by default. It runs on all batches 
  except
 those that set the I915_EXEC_SECURE flag in the execbuffer2 
  call.
 
 I think long-term we should even scan secure batches. We'd need to allow
 some registers which only the drm master (i.e. owner of the display
 hardware) is allowed to do, e.g. for scanline waits. But once we have that
 we should be able to port all current users of secure batches over to
 scanned batches and so enforce this everywhere by default.
 
 The other issue is that igt tests assume to be able to run some evil
 tests, so maybe we don't actually want this.

Agreed. I thought we could handle this as a follow-up task once the basic stuff 
is
in place, particularly given that we'd want to modify at least some users to 
test.
I also wasn't sure if we would want the check to be root  master, as in the 
current
secure flag, or just master.

W.r.t. the tests, I suppose we can just turn checking on for secure batches and 
see
what happens.

 
  There are follow-up patches to libdrm and to i-g-t. The i-g-t tests are very
  basic and do not test all of the commands used by the parser on the 
  assumption
  that I'm likely to make the same mistakes in both the parser and the test.
 
 Yeah, I agree that just checking whether commands all go through (or not)
 as expected adds very little value on top of the few tests you have done.
 I think we should take a look at some corner cases which might trip up
 your checker a bit though:
 - I think we should check batchbuffer chaining and make sure it works on
   the vcs ring and not anywhere else (we can't ever break shipping libva
   which uses this).
 - Some tests to trip up your parser should be done, like 3D commands that
   fall off the end of the batch bo. Or commands that span page boundaries.
   The later isn't an issue atm since you use vmap, but we should switch to
   per-page kmap since the vmap overhead is fairly horrible.

Good suggestions. I'll look into these.

 
  I've run the i-g-t gem_* tests, the piglit quick tests (w/Mesa git from a 
  few
  days ago), and generally used an Ubuntu 13.10 IVB system with the parser
  running. Aside from a failure described below, I don't think there are any
  regressions. That is, piglit claims some regressions, but from manually 
  running
  the tests I think these are false positives. However, I could use help in
  getting broader testing, particularly around performance. In general, I see 
  less
  than 3% performance impact on HSW, with more like 10% impact for 
  pathological
  batch sizes. But we'll certainly want to run relevant benchmarks beyond what
  I've done.
 
 Yeah, a microbenchmark that just shovels MI_NOP batches of various sizes
 through the checker and bypassing it (with EXEC_SECURE) would be really
 good. Maybe even some variable-sized commands (all the state setup stuff
 should be useful for that) to keep things interesting. Some variation is
 also important to have some good cache thrasing going on (since your check
 tables are fairly large I think).

Ok. I'd be interested in some comment from the mesa and libva guys here for 
real world
workloads, but a microbenchmark would be a good start.

Which state setup stuff are you referring to? Something specific in i-g-t or 
something
more general?

 
  At this point 

Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-11-26 Thread Xiang, Haihao
On Tue, 2013-11-26 at 20:35 +0100, Daniel Vetter wrote: 
 Hi Brad,
 
 On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com wrote:
  From: Brad Volkin bradley.d.vol...@intel.com
  
  Certain OpenGL features (e.g. transform feedback, performance monitoring)
  require userspace code to submit batches containing commands such as
  MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some
  generations of the hardware will noop these commands in unsecure batches
  (which includes all userspace batches submitted via i915) even though the
  commands may be safe and represent the intended programming model of the 
  device.
  
  This series introduces a software command parser similar in operation to the
  command parsing done in hardware for unsecure batches. However, the software
  parser allows some operations that would be noop'd by hardware, if the 
  parser
  determines the operation is safe, and submits the batch as secure to 
  prevent
  hardware parsing. Currently the series implements this on IVB and HSW.
  
  The series is divided into several phases:
  
  patches 01-09: These implement infrastructure and the command parsing 
  algorithm,
 all behind a module parameter. I expect some discussion and
 rework, but hopefully there's nothing too controversial.
  patches 10-17: These define the checks performed by the parser.
 I expect much discussion :)
  patches 18-20: In a final pass over the command checks, I found some issues 
  with
 the definitions. They looked painful to rebase in, so I've 
  added
 them here.
  patches 21-22: These enable the parser by default. It runs on all batches 
  except
 those that set the I915_EXEC_SECURE flag in the execbuffer2 
  call.
 
 I think long-term we should even scan secure batches. We'd need to allow
 some registers which only the drm master (i.e. owner of the display
 hardware) is allowed to do, e.g. for scanline waits. But once we have that
 we should be able to port all current users of secure batches over to
 scanned batches and so enforce this everywhere by default.
 
 The other issue is that igt tests assume to be able to run some evil
 tests, so maybe we don't actually want this.
 
  There are follow-up patches to libdrm and to i-g-t. The i-g-t tests are very
  basic and do not test all of the commands used by the parser on the 
  assumption
  that I'm likely to make the same mistakes in both the parser and the test.
 
 Yeah, I agree that just checking whether commands all go through (or not)
 as expected adds very little value on top of the few tests you have done.
 I think we should take a look at some corner cases which might trip up
 your checker a bit though:
 - I think we should check batchbuffer chaining and make sure it works on
   the vcs ring and not anywhere else (we can't ever break shipping libva
   which uses this).

Besides the vcs ring, we also use batchbuffer chaining on the render
ring for video post processing, video motion estimation and motion
compensation(on ILK),  A fixed length batch buffer isn't suitable for
those operations as those operations are based on a macroblock instead
of a frame. It would be better to make sure batchbuffer chaining works
on the render ring too.


 - Some tests to trip up your parser should be done, like 3D commands that
   fall off the end of the batch bo. Or commands that span page boundaries.
   The later isn't an issue atm since you use vmap, but we should switch to
   per-page kmap since the vmap overhead is fairly horrible.
 
  I've run the i-g-t gem_* tests, the piglit quick tests (w/Mesa git from a 
  few
  days ago), and generally used an Ubuntu 13.10 IVB system with the parser
  running. Aside from a failure described below, I don't think there are any
  regressions. That is, piglit claims some regressions, but from manually 
  running
  the tests I think these are false positives. However, I could use help in
  getting broader testing, particularly around performance. In general, I see 
  less
  than 3% performance impact on HSW, with more like 10% impact for 
  pathological
  batch sizes. But we'll certainly want to run relevant benchmarks beyond what
  I've done.
 
 Yeah, a microbenchmark that just shovels MI_NOP batches of various sizes
 through the checker and bypassing it (with EXEC_SECURE) would be really
 good. Maybe even some variable-sized commands (all the state setup stuff
 should be useful for that) to keep things interesting. Some variation is
 also important to have some good cache thrasing going on (since your check
 tables are fairly large I think).
 
  At this point there are a couple of known issues and potential improvements.
  
  1) VLV. The parser is currently disabled for VLV. One type of check 
  performed by
 the parser is that commands which access memory do so via PPGTT. VLV 
  does not
 have PPGTT enabled at this time. I chose to implement the 

Re: [Intel-gfx] [RFC 00/22] Gen7 batch buffer command parser

2013-11-26 Thread ykzhao
On Tue, 2013-11-26 at 13:24 -0700, Volkin, Bradley D wrote:
 On Tue, Nov 26, 2013 at 11:35:38AM -0800, Daniel Vetter wrote:
  Hi Brad,
  
  On Tue, Nov 26, 2013 at 08:51:17AM -0800, bradley.d.vol...@intel.com wrote:
   From: Brad Volkin bradley.d.vol...@intel.com
   
   Certain OpenGL features (e.g. transform feedback, performance monitoring)
   require userspace code to submit batches containing commands such as
   MI_LOAD_REGISTER_IMM to access various registers. Unfortunately, some
   generations of the hardware will noop these commands in unsecure batches
   (which includes all userspace batches submitted via i915) even though the
   commands may be safe and represent the intended programming model of the 
   device.
   
   This series introduces a software command parser similar in operation to 
   the
   command parsing done in hardware for unsecure batches. However, the 
   software
   parser allows some operations that would be noop'd by hardware, if the 
   parser
   determines the operation is safe, and submits the batch as secure to 
   prevent
   hardware parsing. Currently the series implements this on IVB and HSW.
   
   The series is divided into several phases:
   
   patches 01-09: These implement infrastructure and the command parsing 
   algorithm,
  all behind a module parameter. I expect some discussion and
rework, but hopefully there's nothing too controversial.
   patches 10-17: These define the checks performed by the parser.
  I expect much discussion :)
   patches 18-20: In a final pass over the command checks, I found some 
   issues with
  the definitions. They looked painful to rebase in, so I've 
   added
them here.
   patches 21-22: These enable the parser by default. It runs on all batches 
   except
  those that set the I915_EXEC_SECURE flag in the 
   execbuffer2 call.
  
  I think long-term we should even scan secure batches. We'd need to allow
  some registers which only the drm master (i.e. owner of the display
  hardware) is allowed to do, e.g. for scanline waits. But once we have that
  we should be able to port all current users of secure batches over to
  scanned batches and so enforce this everywhere by default.
  
  The other issue is that igt tests assume to be able to run some evil
  tests, so maybe we don't actually want this.
 
 Agreed. I thought we could handle this as a follow-up task once the basic 
 stuff is
 in place, particularly given that we'd want to modify at least some users to 
 test.
 I also wasn't sure if we would want the check to be root  master, as in the 
 current
 secure flag, or just master.
 
 W.r.t. the tests, I suppose we can just turn checking on for secure batches 
 and see
 what happens.
 
  
   There are follow-up patches to libdrm and to i-g-t. The i-g-t tests are 
   very
   basic and do not test all of the commands used by the parser on the 
   assumption
   that I'm likely to make the same mistakes in both the parser and the test.
  
  Yeah, I agree that just checking whether commands all go through (or not)
  as expected adds very little value on top of the few tests you have done.
  I think we should take a look at some corner cases which might trip up
  your checker a bit though:
  - I think we should check batchbuffer chaining and make sure it works on
the vcs ring and not anywhere else (we can't ever break shipping libva
which uses this).
  - Some tests to trip up your parser should be done, like 3D commands that
fall off the end of the batch bo. Or commands that span page boundaries.
The later isn't an issue atm since you use vmap, but we should switch to
per-page kmap since the vmap overhead is fairly horrible.
 
 Good suggestions. I'll look into these.
Hi, Brad
  More inputs from libva about the batchbuffer chaining.

  Now the batchbuffer chaining is widely used in libva driver. This
is related with how the libva driver processes the image. For the
encoding purpose, it needs to be handled based on macroblock(16x16).And
every macroblock needs a group of GPU commands. So the GPU commands for
all the macroblocks will be constructed in the second-level batchbuffer.
The mode of batchbuffer chaining will bring the following benefits:
  a. The size of second-level batch buffer can be allocated based on
the size of handled image. For example: 1080p/720p/480p can use the
different size.
  b. The gpu commands in second-level batchbuffer can be constructed
by using GPU instead of CPU, which is helpful to improve the
performance. 

  At the same time both VCS and Render Ring are used in libva
driver. For example: The encoding will use VCS and RCS ring. Firstly the
RCS ring is used to execute GPU command for the motion vector/mode
prediction. And then the VCS Ring is used to execute the GPU command for
generating the bit-stream. So not only VCS ring uses the mode of
batchbuffer chaining, but also the Render