Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-12 Thread Damien Lespiau
On Sat, May 10, 2014 at 02:10:43PM -0700, bradley.d.vol...@intel.com wrote:
 From: Brad Volkin bradley.d.vol...@intel.com
 
 For clients that submit large batch buffers the command parser has
 a substantial impact on performance. On my HSW ULT system performance
 drops as much as ~20% on some tests. Most of the time is spent in the
 command lookup code. Converting that from the current naive search to
 a hash table lookup reduces the performance drop to ~10%.
 
 The choice of value for I915_CMD_HASH_ORDER allows all commands
 currently used in the parser tables to hash to their own bucket (except
 for one collision on the render ring). The tradeoff is that it wastes
 memory. Because the opcodes for the commands in the tables are not
 particularly well distributed, reducing the order still leaves many
 buckets empty. The increased collisions don't seem to have a huge
 impact on the performance gain, but for now anyhow, the parser trades
 memory for performance.
 
 NB: Ville noticed that the error paths through the ring init code
 will leak memory. I've not addressed that here. We can do a follow
 up pass to handle all of the leaks.
 
 v2: improved comment describing selection of hash key mask (Damien)
 replace a BUG_ON() with an error return (Tvrtko, Ville)
 commit message improvements
 
 Signed-off-by: Brad Volkin bradley.d.vol...@intel.com

Reviewed-by: Damien Lespiau damien.lesp...@intel.com

-- 
Damien

 ---
  drivers/gpu/drm/i915/i915_cmd_parser.c  | 158 
 +---
  drivers/gpu/drm/i915/i915_drv.h |   3 +-
  drivers/gpu/drm/i915/intel_ringbuffer.c |   6 +-
  drivers/gpu/drm/i915/intel_ringbuffer.h |  11 ++-
  4 files changed, 140 insertions(+), 38 deletions(-)
 
 diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c 
 b/drivers/gpu/drm/i915/i915_cmd_parser.c
 index 69d34e4..d3a5b74 100644
 --- a/drivers/gpu/drm/i915/i915_cmd_parser.c
 +++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
 @@ -498,16 +498,18 @@ static u32 gen7_blt_get_cmd_length_mask(u32 cmd_header)
   return 0;
  }
  
 -static bool validate_cmds_sorted(struct intel_ring_buffer *ring)
 +static bool validate_cmds_sorted(struct intel_ring_buffer *ring,
 +  const struct drm_i915_cmd_table *cmd_tables,
 +  int cmd_table_count)
  {
   int i;
   bool ret = true;
  
 - if (!ring-cmd_tables || ring-cmd_table_count == 0)
 + if (!cmd_tables || cmd_table_count == 0)
   return true;
  
 - for (i = 0; i  ring-cmd_table_count; i++) {
 - const struct drm_i915_cmd_table *table = ring-cmd_tables[i];
 + for (i = 0; i  cmd_table_count; i++) {
 + const struct drm_i915_cmd_table *table = cmd_tables[i];
   u32 previous = 0;
   int j;
  
 @@ -557,6 +559,68 @@ static bool validate_regs_sorted(struct 
 intel_ring_buffer *ring)
ring-master_reg_count);
  }
  
 +struct cmd_node {
 + const struct drm_i915_cmd_descriptor *desc;
 + struct hlist_node node;
 +};
 +
 +/*
 + * Different command ranges have different numbers of bits for the opcode. 
 For
 + * example, MI commands use bits 31:23 while 3D commands use bits 31:16. The
 + * problem is that, for example, MI commands use bits 22:16 for other fields
 + * such as GGTT vs PPGTT bits. If we include those bits in the mask then when
 + * we mask a command from a batch it could hash to the wrong bucket due to
 + * non-opcode bits being set. But if we don't include those bits, some 3D
 + * commands may hash to the same bucket due to not including opcode bits that
 + * make the command unique. For now, we will risk hashing to the same bucket.
 + *
 + * If we attempt to generate a perfect hash, we should be able to look at 
 bits
 + * 31:29 of a command from a batch buffer and use the full mask for that
 + * client. The existing INSTR_CLIENT_MASK/SHIFT defines can be used for this.
 + */
 +#define CMD_HASH_MASK STD_MI_OPCODE_MASK
 +
 +static int init_hash_table(struct intel_ring_buffer *ring,
 +const struct drm_i915_cmd_table *cmd_tables,
 +int cmd_table_count)
 +{
 + int i, j;
 +
 + hash_init(ring-cmd_hash);
 +
 + for (i = 0; i  cmd_table_count; i++) {
 + const struct drm_i915_cmd_table *table = cmd_tables[i];
 +
 + for (j = 0; j  table-count; j++) {
 + const struct drm_i915_cmd_descriptor *desc =
 + table-table[j];
 + struct cmd_node *desc_node =
 + kmalloc(sizeof(*desc_node), GFP_KERNEL);
 +
 + if (!desc_node)
 + return -ENOMEM;
 +
 + desc_node-desc = desc;
 + hash_add(ring-cmd_hash, desc_node-node,
 +  desc-cmd.value  CMD_HASH_MASK);
 + }
 + }
 +
 + return 0;
 +}
 +
 +static void fini_hash_table(struct 

Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-12 Thread Tvrtko Ursulin


On 05/10/2014 10:10 PM, bradley.d.vol...@intel.com wrote:

From: Brad Volkin bradley.d.vol...@intel.com

For clients that submit large batch buffers the command parser has
a substantial impact on performance. On my HSW ULT system performance
drops as much as ~20% on some tests. Most of the time is spent in the
command lookup code. Converting that from the current naive search to
a hash table lookup reduces the performance drop to ~10%.

The choice of value for I915_CMD_HASH_ORDER allows all commands
currently used in the parser tables to hash to their own bucket (except
for one collision on the render ring). The tradeoff is that it wastes
memory. Because the opcodes for the commands in the tables are not
particularly well distributed, reducing the order still leaves many
buckets empty. The increased collisions don't seem to have a huge
impact on the performance gain, but for now anyhow, the parser trades
memory for performance.

NB: Ville noticed that the error paths through the ring init code
will leak memory. I've not addressed that here. We can do a follow
up pass to handle all of the leaks.

v2: improved comment describing selection of hash key mask (Damien)
replace a BUG_ON() with an error return (Tvrtko, Ville)
commit message improvements

Signed-off-by: Brad Volkin bradley.d.vol...@intel.com


Reviewed-by: Tvrtko Ursulin tvrtko.ursu...@intel.com

Regards,

Tvrtko
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-12 Thread Daniel Vetter
On Thu, May 08, 2014 at 09:02:18AM -0700, Volkin, Bradley D wrote:
 On Thu, May 08, 2014 at 08:45:07AM -0700, Ville Syrjälä wrote:
  On Thu, May 08, 2014 at 08:27:16AM -0700, Volkin, Bradley D wrote:
   On Thu, May 08, 2014 at 02:56:05AM -0700, Tvrtko Ursulin wrote:

Hi Brad,

On 04/28/2014 04:22 PM, bradley.d.vol...@intel.com wrote:
[snip]
 - BUG_ON(!validate_cmds_sorted(ring));
 + BUG_ON(!validate_cmds_sorted(ring, cmd_tables, 
 cmd_table_count));
   BUG_ON(!validate_regs_sorted(ring));
 +
 + BUG_ON(init_hash_table(ring, cmd_tables, cmd_table_count));

Is a BUG_ON a bit harsh since the above fails only on ENOMEM condition?

If the concern is not allowing any command execution if parser setup 
has 
failed, it would be nicer to the system as whole to just keep rejecting 
everything, but let the rest of the kernel live to enable debug or 
whatever?

I know it won't happen almost ever so it's a minor point really. I just 
dislike actively hosing the whole system if it is avoidable.
   
   Hi Tvrtko,
   
   I agree that a BUG_ON might be harsh here. I suppose we could log an
   error and disable the command parser. Most command buffers would
   still go through fine but HW parsing would reject some that the SW
   parser might otherwise allow. That could be a bit odd if we ever did
   get a failure - apps/functionality that worked the last time I booted
   suddenly don't this time. The issue would be in the log though.
   
   I don't have a strong preference on this. Whatever people prefer.
  
  If the memory allocation fails there's probably not much point in
  trying to limp along and continue the driver init. So just pass error
  up and let the caller deal with it.
  
  Looking at the error paths up from ring init, we probably leak a ton
  of junk but at least the kernel should remain otherwise operational.
 
 Passing the error up is probably a good option. There was a recent change
 to mark the GPU as wedged if ring init fails, which seems like a reasonable
 enough response to parser init failure. I'll have to look at the cleanup
 paths. The per-ring parser init can probably move around to avoid excessive
 additional cleanup code on failure paths.

With the latest ring init rework from Chris we should just fail the gem
side of things and declare the gpu terminally wedged. But the modeset side
should keep on working, which means users will be able to take a look at
dmesg instead of a black screen (since that's what you get from BUG_ON in
driver load often).

Working modeset with broken gpu is infinitely better than a black screen,
so no BUG_ON here please. At least if the error code isn't much more than
a simply return, we always have to balance bugs in the error code with
better abilities to limp on long enough for a useful bug report.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-12 Thread Volkin, Bradley D
On Mon, May 12, 2014 at 09:24:06AM -0700, Daniel Vetter wrote:
 On Thu, May 08, 2014 at 09:02:18AM -0700, Volkin, Bradley D wrote:
  On Thu, May 08, 2014 at 08:45:07AM -0700, Ville Syrjälä wrote:
   On Thu, May 08, 2014 at 08:27:16AM -0700, Volkin, Bradley D wrote:
On Thu, May 08, 2014 at 02:56:05AM -0700, Tvrtko Ursulin wrote:
 
 Hi Brad,
 
 On 04/28/2014 04:22 PM, bradley.d.vol...@intel.com wrote:
 [snip]
  -   BUG_ON(!validate_cmds_sorted(ring));
  +   BUG_ON(!validate_cmds_sorted(ring, cmd_tables, 
  cmd_table_count));
  BUG_ON(!validate_regs_sorted(ring));
  +
  +   BUG_ON(init_hash_table(ring, cmd_tables, cmd_table_count));
 
 Is a BUG_ON a bit harsh since the above fails only on ENOMEM 
 condition?
 
 If the concern is not allowing any command execution if parser setup 
 has 
 failed, it would be nicer to the system as whole to just keep 
 rejecting 
 everything, but let the rest of the kernel live to enable debug or 
 whatever?
 
 I know it won't happen almost ever so it's a minor point really. I 
 just 
 dislike actively hosing the whole system if it is avoidable.

Hi Tvrtko,

I agree that a BUG_ON might be harsh here. I suppose we could log an
error and disable the command parser. Most command buffers would
still go through fine but HW parsing would reject some that the SW
parser might otherwise allow. That could be a bit odd if we ever did
get a failure - apps/functionality that worked the last time I booted
suddenly don't this time. The issue would be in the log though.

I don't have a strong preference on this. Whatever people prefer.
   
   If the memory allocation fails there's probably not much point in
   trying to limp along and continue the driver init. So just pass error
   up and let the caller deal with it.
   
   Looking at the error paths up from ring init, we probably leak a ton
   of junk but at least the kernel should remain otherwise operational.
  
  Passing the error up is probably a good option. There was a recent change
  to mark the GPU as wedged if ring init fails, which seems like a reasonable
  enough response to parser init failure. I'll have to look at the cleanup
  paths. The per-ring parser init can probably move around to avoid excessive
  additional cleanup code on failure paths.
 
 With the latest ring init rework from Chris we should just fail the gem
 side of things and declare the gpu terminally wedged. But the modeset side
 should keep on working, which means users will be able to take a look at
 dmesg instead of a black screen (since that's what you get from BUG_ON in
 driver load often).
 
 Working modeset with broken gpu is infinitely better than a black screen,
 so no BUG_ON here please. At least if the error code isn't much more than
 a simply return, we always have to balance bugs in the error code with
 better abilities to limp on long enough for a useful bug report.

Ok. I sent a v2 and in that I have i915_cmd_parser_init_ring() pass the
error returned by init_hash_table() back to the caller, who passes it up,
etc. When I looked more closely at Chris' change, the wedge behavior only
happens if the return is -EIO, which it will not be in this case. The only
failure from init_hash_table() is -ENOMEM so if we specifically want the
wedging behavior (vs whatever would normally happen if ring init fails with
-ENOMEM) then we need to modify the return from i915_cmd_parser_init_ring()
to be -EIO. Probably also add a comment explaining why we do the modification.

I'm fine either way. The feedback earlier seemed to be that if we have an
allocation failure we probably can't do much else anyhow, hence just returning
the -ENOMEM.

Brad
 
 -Daniel
 -- 
 Daniel Vetter
 Software Engineer, Intel Corporation
 +41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-12 Thread Daniel Vetter
On Mon, May 12, 2014 at 03:49:02PM +0100, Tvrtko Ursulin wrote:
 
 On 05/10/2014 10:10 PM, bradley.d.vol...@intel.com wrote:
 From: Brad Volkin bradley.d.vol...@intel.com
 
 For clients that submit large batch buffers the command parser has
 a substantial impact on performance. On my HSW ULT system performance
 drops as much as ~20% on some tests. Most of the time is spent in the
 command lookup code. Converting that from the current naive search to
 a hash table lookup reduces the performance drop to ~10%.
 
 The choice of value for I915_CMD_HASH_ORDER allows all commands
 currently used in the parser tables to hash to their own bucket (except
 for one collision on the render ring). The tradeoff is that it wastes
 memory. Because the opcodes for the commands in the tables are not
 particularly well distributed, reducing the order still leaves many
 buckets empty. The increased collisions don't seem to have a huge
 impact on the performance gain, but for now anyhow, the parser trades
 memory for performance.
 
 NB: Ville noticed that the error paths through the ring init code
 will leak memory. I've not addressed that here. We can do a follow
 up pass to handle all of the leaks.
 
 v2: improved comment describing selection of hash key mask (Damien)
 replace a BUG_ON() with an error return (Tvrtko, Ville)
 commit message improvements
 
 Signed-off-by: Brad Volkin bradley.d.vol...@intel.com
 
 Reviewed-by: Tvrtko Ursulin tvrtko.ursu...@intel.com

Nice.

Queued for -next, thanks for the patch.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-12 Thread Daniel Vetter
On Mon, May 12, 2014 at 09:41:02AM -0700, Volkin, Bradley D wrote:
 On Mon, May 12, 2014 at 09:24:06AM -0700, Daniel Vetter wrote:
  On Thu, May 08, 2014 at 09:02:18AM -0700, Volkin, Bradley D wrote:
   On Thu, May 08, 2014 at 08:45:07AM -0700, Ville Syrjälä wrote:
On Thu, May 08, 2014 at 08:27:16AM -0700, Volkin, Bradley D wrote:
 On Thu, May 08, 2014 at 02:56:05AM -0700, Tvrtko Ursulin wrote:
  
  Hi Brad,
  
  On 04/28/2014 04:22 PM, bradley.d.vol...@intel.com wrote:
  [snip]
   - BUG_ON(!validate_cmds_sorted(ring));
   + BUG_ON(!validate_cmds_sorted(ring, cmd_tables, 
   cmd_table_count));
 BUG_ON(!validate_regs_sorted(ring));
   +
   + BUG_ON(init_hash_table(ring, cmd_tables, cmd_table_count));
  
  Is a BUG_ON a bit harsh since the above fails only on ENOMEM 
  condition?
  
  If the concern is not allowing any command execution if parser 
  setup has 
  failed, it would be nicer to the system as whole to just keep 
  rejecting 
  everything, but let the rest of the kernel live to enable debug or 
  whatever?
  
  I know it won't happen almost ever so it's a minor point really. I 
  just 
  dislike actively hosing the whole system if it is avoidable.
 
 Hi Tvrtko,
 
 I agree that a BUG_ON might be harsh here. I suppose we could log an
 error and disable the command parser. Most command buffers would
 still go through fine but HW parsing would reject some that the SW
 parser might otherwise allow. That could be a bit odd if we ever did
 get a failure - apps/functionality that worked the last time I booted
 suddenly don't this time. The issue would be in the log though.
 
 I don't have a strong preference on this. Whatever people prefer.

If the memory allocation fails there's probably not much point in
trying to limp along and continue the driver init. So just pass error
up and let the caller deal with it.

Looking at the error paths up from ring init, we probably leak a ton
of junk but at least the kernel should remain otherwise operational.
   
   Passing the error up is probably a good option. There was a recent change
   to mark the GPU as wedged if ring init fails, which seems like a 
   reasonable
   enough response to parser init failure. I'll have to look at the cleanup
   paths. The per-ring parser init can probably move around to avoid 
   excessive
   additional cleanup code on failure paths.
  
  With the latest ring init rework from Chris we should just fail the gem
  side of things and declare the gpu terminally wedged. But the modeset side
  should keep on working, which means users will be able to take a look at
  dmesg instead of a black screen (since that's what you get from BUG_ON in
  driver load often).
  
  Working modeset with broken gpu is infinitely better than a black screen,
  so no BUG_ON here please. At least if the error code isn't much more than
  a simply return, we always have to balance bugs in the error code with
  better abilities to limp on long enough for a useful bug report.
 
 Ok. I sent a v2 and in that I have i915_cmd_parser_init_ring() pass the
 error returned by init_hash_table() back to the caller, who passes it up,
 etc. When I looked more closely at Chris' change, the wedge behavior only
 happens if the return is -EIO, which it will not be in this case. The only
 failure from init_hash_table() is -ENOMEM so if we specifically want the
 wedging behavior (vs whatever would normally happen if ring init fails with
 -ENOMEM) then we need to modify the return from i915_cmd_parser_init_ring()
 to be -EIO. Probably also add a comment explaining why we do the modification.
 
 I'm fine either way. The feedback earlier seemed to be that if we have an
 allocation failure we probably can't do much else anyhow, hence just returning
 the -ENOMEM.

Oh right. Since -ENOMEM should really be extremely exceptional I guess the
current code is ok. I've mixed things up a bit ;-)

But generally if the hw goes bananas we should continue to limp alone, if
it makes sense.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-10 Thread bradley . d . volkin
From: Brad Volkin bradley.d.vol...@intel.com

For clients that submit large batch buffers the command parser has
a substantial impact on performance. On my HSW ULT system performance
drops as much as ~20% on some tests. Most of the time is spent in the
command lookup code. Converting that from the current naive search to
a hash table lookup reduces the performance drop to ~10%.

The choice of value for I915_CMD_HASH_ORDER allows all commands
currently used in the parser tables to hash to their own bucket (except
for one collision on the render ring). The tradeoff is that it wastes
memory. Because the opcodes for the commands in the tables are not
particularly well distributed, reducing the order still leaves many
buckets empty. The increased collisions don't seem to have a huge
impact on the performance gain, but for now anyhow, the parser trades
memory for performance.

NB: Ville noticed that the error paths through the ring init code
will leak memory. I've not addressed that here. We can do a follow
up pass to handle all of the leaks.

v2: improved comment describing selection of hash key mask (Damien)
replace a BUG_ON() with an error return (Tvrtko, Ville)
commit message improvements

Signed-off-by: Brad Volkin bradley.d.vol...@intel.com
---
 drivers/gpu/drm/i915/i915_cmd_parser.c  | 158 +---
 drivers/gpu/drm/i915/i915_drv.h |   3 +-
 drivers/gpu/drm/i915/intel_ringbuffer.c |   6 +-
 drivers/gpu/drm/i915/intel_ringbuffer.h |  11 ++-
 4 files changed, 140 insertions(+), 38 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c 
b/drivers/gpu/drm/i915/i915_cmd_parser.c
index 69d34e4..d3a5b74 100644
--- a/drivers/gpu/drm/i915/i915_cmd_parser.c
+++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
@@ -498,16 +498,18 @@ static u32 gen7_blt_get_cmd_length_mask(u32 cmd_header)
return 0;
 }
 
-static bool validate_cmds_sorted(struct intel_ring_buffer *ring)
+static bool validate_cmds_sorted(struct intel_ring_buffer *ring,
+const struct drm_i915_cmd_table *cmd_tables,
+int cmd_table_count)
 {
int i;
bool ret = true;
 
-   if (!ring-cmd_tables || ring-cmd_table_count == 0)
+   if (!cmd_tables || cmd_table_count == 0)
return true;
 
-   for (i = 0; i  ring-cmd_table_count; i++) {
-   const struct drm_i915_cmd_table *table = ring-cmd_tables[i];
+   for (i = 0; i  cmd_table_count; i++) {
+   const struct drm_i915_cmd_table *table = cmd_tables[i];
u32 previous = 0;
int j;
 
@@ -557,6 +559,68 @@ static bool validate_regs_sorted(struct intel_ring_buffer 
*ring)
 ring-master_reg_count);
 }
 
+struct cmd_node {
+   const struct drm_i915_cmd_descriptor *desc;
+   struct hlist_node node;
+};
+
+/*
+ * Different command ranges have different numbers of bits for the opcode. For
+ * example, MI commands use bits 31:23 while 3D commands use bits 31:16. The
+ * problem is that, for example, MI commands use bits 22:16 for other fields
+ * such as GGTT vs PPGTT bits. If we include those bits in the mask then when
+ * we mask a command from a batch it could hash to the wrong bucket due to
+ * non-opcode bits being set. But if we don't include those bits, some 3D
+ * commands may hash to the same bucket due to not including opcode bits that
+ * make the command unique. For now, we will risk hashing to the same bucket.
+ *
+ * If we attempt to generate a perfect hash, we should be able to look at bits
+ * 31:29 of a command from a batch buffer and use the full mask for that
+ * client. The existing INSTR_CLIENT_MASK/SHIFT defines can be used for this.
+ */
+#define CMD_HASH_MASK STD_MI_OPCODE_MASK
+
+static int init_hash_table(struct intel_ring_buffer *ring,
+  const struct drm_i915_cmd_table *cmd_tables,
+  int cmd_table_count)
+{
+   int i, j;
+
+   hash_init(ring-cmd_hash);
+
+   for (i = 0; i  cmd_table_count; i++) {
+   const struct drm_i915_cmd_table *table = cmd_tables[i];
+
+   for (j = 0; j  table-count; j++) {
+   const struct drm_i915_cmd_descriptor *desc =
+   table-table[j];
+   struct cmd_node *desc_node =
+   kmalloc(sizeof(*desc_node), GFP_KERNEL);
+
+   if (!desc_node)
+   return -ENOMEM;
+
+   desc_node-desc = desc;
+   hash_add(ring-cmd_hash, desc_node-node,
+desc-cmd.value  CMD_HASH_MASK);
+   }
+   }
+
+   return 0;
+}
+
+static void fini_hash_table(struct intel_ring_buffer *ring)
+{
+   struct hlist_node *tmp;
+   struct cmd_node *desc_node;
+   int i;
+
+   hash_for_each_safe(ring-cmd_hash, i, tmp, desc_node, node) {
+   

Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-08 Thread Tvrtko Ursulin


Hi Brad,

On 04/28/2014 04:22 PM, bradley.d.vol...@intel.com wrote:
[snip]

-   BUG_ON(!validate_cmds_sorted(ring));
+   BUG_ON(!validate_cmds_sorted(ring, cmd_tables, cmd_table_count));
BUG_ON(!validate_regs_sorted(ring));
+
+   BUG_ON(init_hash_table(ring, cmd_tables, cmd_table_count));


Is a BUG_ON a bit harsh since the above fails only on ENOMEM condition?

If the concern is not allowing any command execution if parser setup has 
failed, it would be nicer to the system as whole to just keep rejecting 
everything, but let the rest of the kernel live to enable debug or whatever?


I know it won't happen almost ever so it's a minor point really. I just 
dislike actively hosing the whole system if it is avoidable.


Regards,

Tvrtko
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-08 Thread Damien Lespiau
On Thu, May 08, 2014 at 10:56:05AM +0100, Tvrtko Ursulin wrote:
 
 Hi Brad,
 
 On 04/28/2014 04:22 PM, bradley.d.vol...@intel.com wrote:
 [snip]
 -BUG_ON(!validate_cmds_sorted(ring));
 +BUG_ON(!validate_cmds_sorted(ring, cmd_tables, cmd_table_count));
  BUG_ON(!validate_regs_sorted(ring));
 +
 +BUG_ON(init_hash_table(ring, cmd_tables, cmd_table_count));
 
 Is a BUG_ON a bit harsh since the above fails only on ENOMEM condition?
 
 If the concern is not allowing any command execution if parser setup
 has failed, it would be nicer to the system as whole to just keep
 rejecting everything, but let the rest of the kernel live to enable
 debug or whatever?

Those number_of_cmds allocations are a bit awkward though, couldn't we
just embed the hlist_node into the desciptor struct?

I was hoping we could compute a (near) minimal perfect hash function
though. Let me try to dig a bit.

-- 
Damien
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-08 Thread Tvrtko Ursulin


On 05/08/2014 12:44 PM, Damien Lespiau wrote:

On Thu, May 08, 2014 at 10:56:05AM +0100, Tvrtko Ursulin wrote:


Hi Brad,

On 04/28/2014 04:22 PM, bradley.d.vol...@intel.com wrote:
[snip]

-   BUG_ON(!validate_cmds_sorted(ring));
+   BUG_ON(!validate_cmds_sorted(ring, cmd_tables, cmd_table_count));
BUG_ON(!validate_regs_sorted(ring));
+
+   BUG_ON(init_hash_table(ring, cmd_tables, cmd_table_count));


Is a BUG_ON a bit harsh since the above fails only on ENOMEM condition?

If the concern is not allowing any command execution if parser setup
has failed, it would be nicer to the system as whole to just keep
rejecting everything, but let the rest of the kernel live to enable
debug or whatever?


Those number_of_cmds allocations are a bit awkward though, couldn't we
just embed the hlist_node into the desciptor struct?


Until Brad comes online, I think that is because command descriptors to 
hash table entries are not 1-to-1.


Tvrtko
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-08 Thread Damien Lespiau
On Thu, May 08, 2014 at 01:25:33PM +0100, Tvrtko Ursulin wrote:
 
 On 05/08/2014 12:44 PM, Damien Lespiau wrote:
 On Thu, May 08, 2014 at 10:56:05AM +0100, Tvrtko Ursulin wrote:
 
 Hi Brad,
 
 On 04/28/2014 04:22 PM, bradley.d.vol...@intel.com wrote:
 [snip]
 -  BUG_ON(!validate_cmds_sorted(ring));
 +  BUG_ON(!validate_cmds_sorted(ring, cmd_tables, cmd_table_count));
BUG_ON(!validate_regs_sorted(ring));
 +
 +  BUG_ON(init_hash_table(ring, cmd_tables, cmd_table_count));
 
 Is a BUG_ON a bit harsh since the above fails only on ENOMEM condition?
 
 If the concern is not allowing any command execution if parser setup
 has failed, it would be nicer to the system as whole to just keep
 rejecting everything, but let the rest of the kernel live to enable
 debug or whatever?
 
 Those number_of_cmds allocations are a bit awkward though, couldn't we
 just embed the hlist_node into the desciptor struct?
 
 Until Brad comes online, I think that is because command descriptors
 to hash table entries are not 1-to-1.

Ah, I guess the common cmds are part of several hash tables. We could at
least turn the multiple allocations into one big allocation though.

-- 
Damien
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-08 Thread Damien Lespiau
On Mon, Apr 28, 2014 at 08:22:08AM -0700, bradley.d.vol...@intel.com wrote:
 From: Brad Volkin bradley.d.vol...@intel.com
 +/*
 + * Different command ranges have different numbers of bits for the opcode.
 + * In order to use the opcode bits, and only the opcode bits, for the hash 
 key
 + * we should use the MI_* command opcode mask (since those commands use the
 + * fewest bits for the opcode.)
 + */
 +#define CMD_HASH_MASK STD_MI_OPCODE_MASK

This is not very convicing (but could well be correct).

#define STD_MI_OPCODE_MASK  0xFF80
#define STD_3D_OPCODE_MASK  0x

So it only works if the 3D opcodes have the top 9 bits all distinct?

-- 
Damien
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-08 Thread Damien Lespiau
On Thu, May 08, 2014 at 02:05:07PM +0100, Damien Lespiau wrote:
 On Mon, Apr 28, 2014 at 08:22:08AM -0700, bradley.d.vol...@intel.com wrote:
  From: Brad Volkin bradley.d.vol...@intel.com
  +/*
  + * Different command ranges have different numbers of bits for the opcode.
  + * In order to use the opcode bits, and only the opcode bits, for the hash 
  key
  + * we should use the MI_* command opcode mask (since those commands use the
  + * fewest bits for the opcode.)
  + */
  +#define CMD_HASH_MASK STD_MI_OPCODE_MASK
 
 This is not very convicing (but could well be correct).
 
 #define STD_MI_OPCODE_MASK  0xFF80
 #define STD_3D_OPCODE_MASK  0x
 
 So it only works if the 3D opcodes have the top 9 bits all distinct?

To expand on that, with the attached program:

$ ./minimal-hash-hsw-render  | wc -l
44

$ ./minimal-hash-hsw-render  | sort -u | wc -l
37

-- 
Damien
#include stdint.h
#include stdio.h

#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))

#define MI_INSTR(opcode, flags) (((opcode)  23) | (flags))

#define MI_NOOP MI_INSTR(0, 0)
#define MI_USER_INTERRUPT   MI_INSTR(0x02, 0)
#define MI_WAIT_FOR_EVENT   MI_INSTR(0x03, 0)
#define MI_ARB_CHECKMI_INSTR(0x05, 0)
#define MI_REPORT_HEAD  MI_INSTR(0x07, 0)
#define MI_ARB_ON_OFF   MI_INSTR(0x08, 0)
#define MI_BATCH_BUFFER_END MI_INSTR(0x0a, 0)
#define MI_SUSPEND_FLUSHMI_INSTR(0x0b, 0)
#define MI_OVERLAY_FLIP MI_INSTR(0x11, 0)
#define MI_SET_PREDICATEMI_INSTR(0x01, 0)
#define MI_ARB_CHECKMI_INSTR(0x05, 0)
#define MI_RS_CONTROL   MI_INSTR(0x06, 0)
#define MI_URB_ATOMIC_ALLOC MI_INSTR(0x09, 0)
#define MI_PREDICATEMI_INSTR(0x0C, 0)
#define MI_RS_CONTEXT   MI_INSTR(0x0F, 0)
#define MI_TOPOLOGY_FILTER  MI_INSTR(0x0D, 0)
#define MI_LOAD_SCAN_LINES_EXCL MI_INSTR(0x13, 0)
#define MI_URB_CLEARMI_INSTR(0x19, 0)
#define MI_UPDATE_GTT   MI_INSTR(0x23, 0)
#define MI_CLFLUSH  MI_INSTR(0x27, 0)
#define MI_REPORT_PERF_COUNTMI_INSTR(0x28, 0)
#define MI_LOAD_REGISTER_MEMMI_INSTR(0x29, 0)
#define MI_LOAD_REGISTER_REGMI_INSTR(0x2A, 0)
#define MI_RS_STORE_DATA_IMMMI_INSTR(0x2B, 0)
#define MI_LOAD_URB_MEM MI_INSTR(0x2C, 0)
#define MI_STORE_URB_MEMMI_INSTR(0x2D, 0)
#define MI_CONDITIONAL_BATCH_BUFFER_END MI_INSTR(0x36, 0)
#define MI_SEMAPHORE_MBOX   MI_INSTR(0x16, 1) /* gen6+ */
#define MI_STORE_DWORD_IMM  MI_INSTR(0x20, 1)
#define MI_STORE_DWORD_INDEXMI_INSTR(0x21, 1)
#define MI_SET_CONTEXT  MI_INSTR(0x18, 0)
#define MI_LOAD_REGISTER_IMM(x) MI_INSTR(0x22, 2*(x)-1)
#define MI_STORE_REGISTER_MEM(x) MI_INSTR(0x24, 2*(x)-1)
#define MI_BATCH_BUFFER_START   MI_INSTR(0x31, 0)
#define MI_FLUSHMI_INSTR(0x04, 0)
#define MI_DISPLAY_FLIP MI_INSTR(0x14, 2)
#define MI_LOAD_SCAN_LINES_INCL MI_INSTR(0x12, 0)

#define GFX_OP_PIPE_CONTROL(len)((0x329)|(0x327)|(0x224)|(len-2))
#define PIPELINE_SELECT((0x329)|(0x127)|(0x124)|(0x416))
#define GFX_OP_3DSTATE_VF_STATISTICS   ((0x329)|(0x127)|(0x024)|(0xB16))
#define MEDIA_VFE_STATE((0x329)|(0x227)|(0x024)|(0x016))
#define  MEDIA_VFE_STATE_MMIO_ACCESS_MASK (0x18)
#define GPGPU_OBJECT   ((0x329)|(0x227)|(0x124)|(0x416))
#define GPGPU_WALKER   ((0x329)|(0x227)|(0x124)|(0x516))
#define GFX_OP_3DSTATE_DX9_CONSTANTF_VS \
((0x329)|(0x327)|(0x024)|(0x3916))
#define GFX_OP_3DSTATE_DX9_CONSTANTF_PS \
((0x329)|(0x327)|(0x024)|(0x3A16))
#define GFX_OP_3DSTATE_SO_DECL_LIST \
((0x329)|(0x327)|(0x124)|(0x1716))

#define GFX_OP_3DSTATE_BINDING_TABLE_EDIT_VS \
((0x329)|(0x327)|(0x024)|(0x4316))
#define GFX_OP_3DSTATE_BINDING_TABLE_EDIT_GS \
((0x329)|(0x327)|(0x024)|(0x4416))
#define GFX_OP_3DSTATE_BINDING_TABLE_EDIT_HS \
((0x329)|(0x327)|(0x024)|(0x4516))
#define GFX_OP_3DSTATE_BINDING_TABLE_EDIT_DS \
((0x329)|(0x327)|(0x024)|(0x4616))
#define GFX_OP_3DSTATE_BINDING_TABLE_EDIT_PS \
((0x329)|(0x327)|(0x024)|(0x4716))

#define MFX_WAIT  ((0x329)|(0x127)|(0x016))

#define COLOR_BLT ((0x229)|(0x4022))
#define SRC_COPY_BLT  ((0x229)|(0x4322))

uint32_t opcodes[] = {
MI_NOOP,
MI_USER_INTERRUPT,
MI_WAIT_FOR_EVENT,
MI_ARB_CHECK,
MI_REPORT_HEAD,
MI_SUSPEND_FLUSH,
MI_SEMAPHORE_MBOX,
MI_STORE_DWORD_INDEX,
MI_LOAD_REGISTER_IMM(1),
MI_STORE_REGISTER_MEM(1),
MI_LOAD_REGISTER_MEM,
MI_BATCH_BUFFER_START,
MI_FLUSH,
MI_ARB_ON_OFF,
MI_PREDICATE,
MI_TOPOLOGY_FILTER,
MI_DISPLAY_FLIP,
MI_SET_CONTEXT,
MI_URB_CLEAR,
MI_STORE_DWORD_IMM,
MI_UPDATE_GTT,
MI_CLFLUSH,
MI_REPORT_PERF_COUNT,
MI_CONDITIONAL_BATCH_BUFFER_END,
GFX_OP_3DSTATE_VF_STATISTICS,
PIPELINE_SELECT,
MEDIA_VFE_STATE,
GPGPU_OBJECT,
  

Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-08 Thread Tvrtko Ursulin


On 05/08/2014 02:02 PM, Damien Lespiau wrote:

On Thu, May 08, 2014 at 01:25:33PM +0100, Tvrtko Ursulin wrote:


On 05/08/2014 12:44 PM, Damien Lespiau wrote:

On Thu, May 08, 2014 at 10:56:05AM +0100, Tvrtko Ursulin wrote:


Hi Brad,

On 04/28/2014 04:22 PM, bradley.d.vol...@intel.com wrote:
[snip]

-   BUG_ON(!validate_cmds_sorted(ring));
+   BUG_ON(!validate_cmds_sorted(ring, cmd_tables, cmd_table_count));
BUG_ON(!validate_regs_sorted(ring));
+
+   BUG_ON(init_hash_table(ring, cmd_tables, cmd_table_count));


Is a BUG_ON a bit harsh since the above fails only on ENOMEM condition?

If the concern is not allowing any command execution if parser setup
has failed, it would be nicer to the system as whole to just keep
rejecting everything, but let the rest of the kernel live to enable
debug or whatever?


Those number_of_cmds allocations are a bit awkward though, couldn't we
just embed the hlist_node into the desciptor struct?


Until Brad comes online, I think that is because command descriptors
to hash table entries are not 1-to-1.


Ah, I guess the common cmds are part of several hash tables. We could at
least turn the multiple allocations into one big allocation though.


Probably a problem since commands can be added dynamically.

Also from the memory use point of view, single allocations are 12/24 
bytes so fit into 16/32 byte slabs. That's some wastage on the face of 
it, but bunching them up would make... say common plus render commands 
are ~30 in total, so 360/720 bytes. Which would waste 152/304 bytes (512 
and 1024 slabs) vs. 120/240 bytes for individual allocations. So for 30 
command array current approach is even better. :)


Maybe make a dedicated cache for struct cmd_node?

Tvrtko
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-08 Thread Tvrtko Ursulin


On 04/28/2014 04:22 PM, bradley.d.vol...@intel.com wrote:

From: Brad Volkin bradley.d.vol...@intel.com

For clients that submit large batch buffers the command parser has
a substantial impact on performance. On my HSW ULT system performance
drops as much as ~20% on some tests. Most of the time is spent in the
command lookup code. Converting that from the current naive search to
a hash table lookup reduces the performance impact by as much as ~10%.


I have another question - what about register lookups, do they come up 
in profiling as significant?


Regards,

Tvrtko
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-08 Thread Damien Lespiau
On Thu, May 08, 2014 at 12:44:57PM +0100, Damien Lespiau wrote:
 I was hoping we could compute a (near) minimal perfect hash function
 though. Let me try to dig a bit.

So, I went a bit further here and we can actually generate a minimal
perfect hash function. I took the 44 HSW render opcodes and fed them
into:

  http://burtleburtle.net/bob/hash/perfect.html
  https://github.com/driedfruit/jenkins-minimal-perfect-hash

The resulting hash function is:

static uint8_t mph_tab[] = {
0,0,10,0,51,0,34,0,12,22,0,0,28,0,9,0,
9,0,19,0,0,0,11,0,10,61,20,50,49,60,45,55,
};

static uint32_t mph_hash(uint32_t val)
{
uint32_t a, b, rsl;

val += 0xa195a44b;
val += (val  8);
val ^= (val  4);
b = (val  18)  0x1f;
a = val  27;
rsl = (a^mph_tab[b]);

return rsl;
}

When used the 44 HSW opcodes (opcode  0x) it produces distinct numbers
(perfect hash) from 0 to 43 (minimal) that could then be used to address an 
array.

You can play with the joined test:

$ ./minimal-hash-hsw-render  | sort -nk2

So now, how useful is that is an interesting question. We do have static tables
upstream (IIUC), but Tvrtko reminded me that it may not match other trees,
which is sad.

-- 
Damien
#include stdint.h
#include stdio.h

#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))

#define MI_INSTR(opcode, flags) (((opcode)  23) | (flags))

#define MI_NOOP MI_INSTR(0, 0)
#define MI_USER_INTERRUPT   MI_INSTR(0x02, 0)
#define MI_WAIT_FOR_EVENT   MI_INSTR(0x03, 0)
#define MI_ARB_CHECKMI_INSTR(0x05, 0)
#define MI_REPORT_HEAD  MI_INSTR(0x07, 0)
#define MI_ARB_ON_OFF   MI_INSTR(0x08, 0)
#define MI_BATCH_BUFFER_END MI_INSTR(0x0a, 0)
#define MI_SUSPEND_FLUSHMI_INSTR(0x0b, 0)
#define MI_OVERLAY_FLIP MI_INSTR(0x11, 0)
#define MI_SET_PREDICATEMI_INSTR(0x01, 0)
#define MI_ARB_CHECKMI_INSTR(0x05, 0)
#define MI_RS_CONTROL   MI_INSTR(0x06, 0)
#define MI_URB_ATOMIC_ALLOC MI_INSTR(0x09, 0)
#define MI_PREDICATEMI_INSTR(0x0C, 0)
#define MI_RS_CONTEXT   MI_INSTR(0x0F, 0)
#define MI_TOPOLOGY_FILTER  MI_INSTR(0x0D, 0)
#define MI_LOAD_SCAN_LINES_EXCL MI_INSTR(0x13, 0)
#define MI_URB_CLEARMI_INSTR(0x19, 0)
#define MI_UPDATE_GTT   MI_INSTR(0x23, 0)
#define MI_CLFLUSH  MI_INSTR(0x27, 0)
#define MI_REPORT_PERF_COUNTMI_INSTR(0x28, 0)
#define MI_LOAD_REGISTER_MEMMI_INSTR(0x29, 0)
#define MI_LOAD_REGISTER_REGMI_INSTR(0x2A, 0)
#define MI_RS_STORE_DATA_IMMMI_INSTR(0x2B, 0)
#define MI_LOAD_URB_MEM MI_INSTR(0x2C, 0)
#define MI_STORE_URB_MEMMI_INSTR(0x2D, 0)
#define MI_CONDITIONAL_BATCH_BUFFER_END MI_INSTR(0x36, 0)
#define MI_SEMAPHORE_MBOX   MI_INSTR(0x16, 1) /* gen6+ */
#define MI_STORE_DWORD_IMM  MI_INSTR(0x20, 1)
#define MI_STORE_DWORD_INDEXMI_INSTR(0x21, 1)
#define MI_SET_CONTEXT  MI_INSTR(0x18, 0)
#define MI_LOAD_REGISTER_IMM(x) MI_INSTR(0x22, 2*(x)-1)
#define MI_STORE_REGISTER_MEM(x) MI_INSTR(0x24, 2*(x)-1)
#define MI_BATCH_BUFFER_START   MI_INSTR(0x31, 0)
#define MI_FLUSHMI_INSTR(0x04, 0)
#define MI_DISPLAY_FLIP MI_INSTR(0x14, 2)
#define MI_LOAD_SCAN_LINES_INCL MI_INSTR(0x12, 0)

#define GFX_OP_PIPE_CONTROL(len)((0x329)|(0x327)|(0x224)|(len-2))
#define PIPELINE_SELECT((0x329)|(0x127)|(0x124)|(0x416))
#define GFX_OP_3DSTATE_VF_STATISTICS   ((0x329)|(0x127)|(0x024)|(0xB16))
#define MEDIA_VFE_STATE((0x329)|(0x227)|(0x024)|(0x016))
#define  MEDIA_VFE_STATE_MMIO_ACCESS_MASK (0x18)
#define GPGPU_OBJECT   ((0x329)|(0x227)|(0x124)|(0x416))
#define GPGPU_WALKER   ((0x329)|(0x227)|(0x124)|(0x516))
#define GFX_OP_3DSTATE_DX9_CONSTANTF_VS \
((0x329)|(0x327)|(0x024)|(0x3916))
#define GFX_OP_3DSTATE_DX9_CONSTANTF_PS \
((0x329)|(0x327)|(0x024)|(0x3A16))
#define GFX_OP_3DSTATE_SO_DECL_LIST \
((0x329)|(0x327)|(0x124)|(0x1716))

#define GFX_OP_3DSTATE_BINDING_TABLE_EDIT_VS \
((0x329)|(0x327)|(0x024)|(0x4316))
#define GFX_OP_3DSTATE_BINDING_TABLE_EDIT_GS \
((0x329)|(0x327)|(0x024)|(0x4416))
#define GFX_OP_3DSTATE_BINDING_TABLE_EDIT_HS \
((0x329)|(0x327)|(0x024)|(0x4516))
#define GFX_OP_3DSTATE_BINDING_TABLE_EDIT_DS \
((0x329)|(0x327)|(0x024)|(0x4616))
#define GFX_OP_3DSTATE_BINDING_TABLE_EDIT_PS \
((0x329)|(0x327)|(0x024)|(0x4716))

#define MFX_WAIT  ((0x329)|(0x127)|(0x016))

#define COLOR_BLT ((0x229)|(0x4022))
#define SRC_COPY_BLT  ((0x229)|(0x4322))

uint32_t opcodes[] = {
MI_NOOP,
MI_USER_INTERRUPT,
MI_WAIT_FOR_EVENT,
MI_ARB_CHECK,
MI_REPORT_HEAD,
MI_SUSPEND_FLUSH,
MI_SEMAPHORE_MBOX,
MI_STORE_DWORD_INDEX,
MI_LOAD_REGISTER_IMM(1),
MI_STORE_REGISTER_MEM(1),
MI_LOAD_REGISTER_MEM,
MI_BATCH_BUFFER_START,
MI_FLUSH,

Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-08 Thread Volkin, Bradley D
On Thu, May 08, 2014 at 02:56:05AM -0700, Tvrtko Ursulin wrote:
 
 Hi Brad,
 
 On 04/28/2014 04:22 PM, bradley.d.vol...@intel.com wrote:
 [snip]
  -   BUG_ON(!validate_cmds_sorted(ring));
  +   BUG_ON(!validate_cmds_sorted(ring, cmd_tables, cmd_table_count));
  BUG_ON(!validate_regs_sorted(ring));
  +
  +   BUG_ON(init_hash_table(ring, cmd_tables, cmd_table_count));
 
 Is a BUG_ON a bit harsh since the above fails only on ENOMEM condition?
 
 If the concern is not allowing any command execution if parser setup has 
 failed, it would be nicer to the system as whole to just keep rejecting 
 everything, but let the rest of the kernel live to enable debug or whatever?
 
 I know it won't happen almost ever so it's a minor point really. I just 
 dislike actively hosing the whole system if it is avoidable.

Hi Tvrtko,

I agree that a BUG_ON might be harsh here. I suppose we could log an
error and disable the command parser. Most command buffers would
still go through fine but HW parsing would reject some that the SW
parser might otherwise allow. That could be a bit odd if we ever did
get a failure - apps/functionality that worked the last time I booted
suddenly don't this time. The issue would be in the log though.

I don't have a strong preference on this. Whatever people prefer.

Thanks,
Brad

 
 Regards,
 
 Tvrtko
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-08 Thread Volkin, Bradley D
On Thu, May 08, 2014 at 06:42:16AM -0700, Tvrtko Ursulin wrote:
 
 On 04/28/2014 04:22 PM, bradley.d.vol...@intel.com wrote:
  From: Brad Volkin bradley.d.vol...@intel.com
 
  For clients that submit large batch buffers the command parser has
  a substantial impact on performance. On my HSW ULT system performance
  drops as much as ~20% on some tests. Most of the time is spent in the
  command lookup code. Converting that from the current naive search to
  a hash table lookup reduces the performance impact by as much as ~10%.
 
 I have another question - what about register lookups, do they come up 
 in profiling as significant?

So far no. The command lookup happens for every command in every batch.
The register lookup only happens for certain commands. I think most
batches don't include such commands and those that do may only have a
few instances. Ken might have a better sense of the fequency.

Brad

 
 Regards,
 
 Tvrtko
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-08 Thread Ville Syrjälä
On Thu, May 08, 2014 at 08:27:16AM -0700, Volkin, Bradley D wrote:
 On Thu, May 08, 2014 at 02:56:05AM -0700, Tvrtko Ursulin wrote:
  
  Hi Brad,
  
  On 04/28/2014 04:22 PM, bradley.d.vol...@intel.com wrote:
  [snip]
   - BUG_ON(!validate_cmds_sorted(ring));
   + BUG_ON(!validate_cmds_sorted(ring, cmd_tables, cmd_table_count));
 BUG_ON(!validate_regs_sorted(ring));
   +
   + BUG_ON(init_hash_table(ring, cmd_tables, cmd_table_count));
  
  Is a BUG_ON a bit harsh since the above fails only on ENOMEM condition?
  
  If the concern is not allowing any command execution if parser setup has 
  failed, it would be nicer to the system as whole to just keep rejecting 
  everything, but let the rest of the kernel live to enable debug or whatever?
  
  I know it won't happen almost ever so it's a minor point really. I just 
  dislike actively hosing the whole system if it is avoidable.
 
 Hi Tvrtko,
 
 I agree that a BUG_ON might be harsh here. I suppose we could log an
 error and disable the command parser. Most command buffers would
 still go through fine but HW parsing would reject some that the SW
 parser might otherwise allow. That could be a bit odd if we ever did
 get a failure - apps/functionality that worked the last time I booted
 suddenly don't this time. The issue would be in the log though.
 
 I don't have a strong preference on this. Whatever people prefer.

If the memory allocation fails there's probably not much point in
trying to limp along and continue the driver init. So just pass error
up and let the caller deal with it.

Looking at the error paths up from ring init, we probably leak a ton
of junk but at least the kernel should remain otherwise operational.

-- 
Ville Syrjälä
Intel OTC
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-08 Thread Tvrtko Ursulin


On 05/08/2014 04:27 PM, Volkin, Bradley D wrote:

On Thu, May 08, 2014 at 02:56:05AM -0700, Tvrtko Ursulin wrote:


Hi Brad,

On 04/28/2014 04:22 PM, bradley.d.vol...@intel.com wrote:
[snip]

-   BUG_ON(!validate_cmds_sorted(ring));
+   BUG_ON(!validate_cmds_sorted(ring, cmd_tables, cmd_table_count));
BUG_ON(!validate_regs_sorted(ring));
+
+   BUG_ON(init_hash_table(ring, cmd_tables, cmd_table_count));


Is a BUG_ON a bit harsh since the above fails only on ENOMEM condition?

If the concern is not allowing any command execution if parser setup has
failed, it would be nicer to the system as whole to just keep rejecting
everything, but let the rest of the kernel live to enable debug or whatever?

I know it won't happen almost ever so it's a minor point really. I just
dislike actively hosing the whole system if it is avoidable.


Hi Tvrtko,

I agree that a BUG_ON might be harsh here. I suppose we could log an
error and disable the command parser. Most command buffers would
still go through fine but HW parsing would reject some that the SW
parser might otherwise allow. That could be a bit odd if we ever did
get a failure - apps/functionality that worked the last time I booted
suddenly don't this time. The issue would be in the log though.


That would indeed be weird.

So command parser is just a more permissive whitelist, a super set of 
hardware parser?



I don't have a strong preference on this. Whatever people prefer.


Initially I was thinking of just rejecting all batch buffers if such 
fundamental initialisation fails. It beats BUG_ON in the sense that it 
leaves the system alive and it doesn't suffer from the random subtle 
behaviour like if it would just disable the command parser.


It is a bit theoretical because it is likely system won't be too healthy 
anyway if such a tiny allocation fails but you never know. It's better 
not to crash it if we don't have to.


Regards,

Tvrtko
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-08 Thread Volkin, Bradley D
On Thu, May 08, 2014 at 06:15:44AM -0700, Lespiau, Damien wrote:
 On Thu, May 08, 2014 at 02:05:07PM +0100, Damien Lespiau wrote:
  On Mon, Apr 28, 2014 at 08:22:08AM -0700, bradley.d.vol...@intel.com wrote:
   From: Brad Volkin bradley.d.vol...@intel.com
   +/*
   + * Different command ranges have different numbers of bits for the 
   opcode.
   + * In order to use the opcode bits, and only the opcode bits, for the 
   hash key
   + * we should use the MI_* command opcode mask (since those commands use 
   the
   + * fewest bits for the opcode.)
   + */
   +#define CMD_HASH_MASK STD_MI_OPCODE_MASK
  
  This is not very convicing (but could well be correct).
  
  #define STD_MI_OPCODE_MASK  0xFF80
  #define STD_3D_OPCODE_MASK  0x
  
  So it only works if the 3D opcodes have the top 9 bits all distinct?
 
 To expand on that, with the attached program:
 
 $ ./minimal-hash-hsw-render  | wc -l
 44
 
 $ ./minimal-hash-hsw-render  | sort -u | wc -l
 37

Yes, as it's currently written, some commands may hash to the same
bucket. The per-bucket search will use the full mask from the cmd
descriptor to get an exact match.

The issue is that, for example, MI commands in a batch may use bits
22:16 for something other than the opcode (e.g. GGTT vs PPGTT). If we
mask a command from a batch with 0x then an MI command may hash
to the wrong bucket. If we want a perfect hash then I suppose we should
look at bits 31:29 and mask with the exact STD_xx_OPCODE_MASK for the
client. The existing INSTR_CLIENT_MASK/SHIFT defines could be reused
for that.

Brad

 
 -- 
 Damien

 #include stdint.h
 #include stdio.h
 
 #define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
 
 #define MI_INSTR(opcode, flags) (((opcode)  23) | (flags))
 
 #define MI_NOOP   MI_INSTR(0, 0)
 #define MI_USER_INTERRUPT MI_INSTR(0x02, 0)
 #define MI_WAIT_FOR_EVENT   MI_INSTR(0x03, 0)
 #define MI_ARB_CHECKMI_INSTR(0x05, 0)
 #define MI_REPORT_HEADMI_INSTR(0x07, 0)
 #define MI_ARB_ON_OFF MI_INSTR(0x08, 0)
 #define MI_BATCH_BUFFER_END   MI_INSTR(0x0a, 0)
 #define MI_SUSPEND_FLUSH  MI_INSTR(0x0b, 0)
 #define MI_OVERLAY_FLIP   MI_INSTR(0x11, 0)
 #define MI_SET_PREDICATEMI_INSTR(0x01, 0)
 #define MI_ARB_CHECKMI_INSTR(0x05, 0)
 #define MI_RS_CONTROL   MI_INSTR(0x06, 0)
 #define MI_URB_ATOMIC_ALLOC MI_INSTR(0x09, 0)
 #define MI_PREDICATEMI_INSTR(0x0C, 0)
 #define MI_RS_CONTEXT   MI_INSTR(0x0F, 0)
 #define MI_TOPOLOGY_FILTER  MI_INSTR(0x0D, 0)
 #define MI_LOAD_SCAN_LINES_EXCL MI_INSTR(0x13, 0)
 #define MI_URB_CLEARMI_INSTR(0x19, 0)
 #define MI_UPDATE_GTT   MI_INSTR(0x23, 0)
 #define MI_CLFLUSH  MI_INSTR(0x27, 0)
 #define MI_REPORT_PERF_COUNTMI_INSTR(0x28, 0)
 #define MI_LOAD_REGISTER_MEMMI_INSTR(0x29, 0)
 #define MI_LOAD_REGISTER_REGMI_INSTR(0x2A, 0)
 #define MI_RS_STORE_DATA_IMMMI_INSTR(0x2B, 0)
 #define MI_LOAD_URB_MEM MI_INSTR(0x2C, 0)
 #define MI_STORE_URB_MEMMI_INSTR(0x2D, 0)
 #define MI_CONDITIONAL_BATCH_BUFFER_END MI_INSTR(0x36, 0)
 #define MI_SEMAPHORE_MBOX MI_INSTR(0x16, 1) /* gen6+ */
 #define MI_STORE_DWORD_IMMMI_INSTR(0x20, 1)
 #define MI_STORE_DWORD_INDEX  MI_INSTR(0x21, 1)
 #define MI_SET_CONTEXTMI_INSTR(0x18, 0)
 #define MI_LOAD_REGISTER_IMM(x)   MI_INSTR(0x22, 2*(x)-1)
 #define MI_STORE_REGISTER_MEM(x) MI_INSTR(0x24, 2*(x)-1)
 #define MI_BATCH_BUFFER_START MI_INSTR(0x31, 0)
 #define MI_FLUSH  MI_INSTR(0x04, 0)
 #define MI_DISPLAY_FLIP   MI_INSTR(0x14, 2)
 #define MI_LOAD_SCAN_LINES_INCL MI_INSTR(0x12, 0)
 
 #define GFX_OP_PIPE_CONTROL(len)  ((0x329)|(0x327)|(0x224)|(len-2))
 #define PIPELINE_SELECT
 ((0x329)|(0x127)|(0x124)|(0x416))
 #define GFX_OP_3DSTATE_VF_STATISTICS   
 ((0x329)|(0x127)|(0x024)|(0xB16))
 #define MEDIA_VFE_STATE
 ((0x329)|(0x227)|(0x024)|(0x016))
 #define  MEDIA_VFE_STATE_MMIO_ACCESS_MASK (0x18)
 #define GPGPU_OBJECT   
 ((0x329)|(0x227)|(0x124)|(0x416))
 #define GPGPU_WALKER   
 ((0x329)|(0x227)|(0x124)|(0x516))
 #define GFX_OP_3DSTATE_DX9_CONSTANTF_VS \
   ((0x329)|(0x327)|(0x024)|(0x3916))
 #define GFX_OP_3DSTATE_DX9_CONSTANTF_PS \
   ((0x329)|(0x327)|(0x024)|(0x3A16))
 #define GFX_OP_3DSTATE_SO_DECL_LIST \
   ((0x329)|(0x327)|(0x124)|(0x1716))
 
 #define GFX_OP_3DSTATE_BINDING_TABLE_EDIT_VS \
   ((0x329)|(0x327)|(0x024)|(0x4316))
 #define GFX_OP_3DSTATE_BINDING_TABLE_EDIT_GS \
   ((0x329)|(0x327)|(0x024)|(0x4416))
 #define GFX_OP_3DSTATE_BINDING_TABLE_EDIT_HS \
   ((0x329)|(0x327)|(0x024)|(0x4516))
 #define GFX_OP_3DSTATE_BINDING_TABLE_EDIT_DS \
   ((0x329)|(0x327)|(0x024)|(0x4616))
 #define GFX_OP_3DSTATE_BINDING_TABLE_EDIT_PS \
   ((0x329)|(0x327)|(0x024)|(0x4716))
 
 #define MFX_WAIT  ((0x329)|(0x127)|(0x016))
 
 #define COLOR_BLT ((0x229)|(0x4022))
 #define 

Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-08 Thread Damien Lespiau
On Thu, May 08, 2014 at 08:53:38AM -0700, Volkin, Bradley D wrote:
 On Thu, May 08, 2014 at 06:15:44AM -0700, Lespiau, Damien wrote:
  On Thu, May 08, 2014 at 02:05:07PM +0100, Damien Lespiau wrote:
   On Mon, Apr 28, 2014 at 08:22:08AM -0700, bradley.d.vol...@intel.com 
   wrote:
From: Brad Volkin bradley.d.vol...@intel.com
+/*
+ * Different command ranges have different numbers of bits for the 
opcode.
+ * In order to use the opcode bits, and only the opcode bits, for the 
hash key
+ * we should use the MI_* command opcode mask (since those commands 
use the
+ * fewest bits for the opcode.)
+ */
+#define CMD_HASH_MASK STD_MI_OPCODE_MASK
   
   This is not very convicing (but could well be correct).
   
   #define STD_MI_OPCODE_MASK  0xFF80
   #define STD_3D_OPCODE_MASK  0x
   
   So it only works if the 3D opcodes have the top 9 bits all distinct?
  
  To expand on that, with the attached program:
  
  $ ./minimal-hash-hsw-render  | wc -l
  44
  
  $ ./minimal-hash-hsw-render  | sort -u | wc -l
  37
 
 Yes, as it's currently written, some commands may hash to the same
 bucket. The per-bucket search will use the full mask from the cmd
 descriptor to get an exact match.
 
 The issue is that, for example, MI commands in a batch may use bits
 22:16 for something other than the opcode (e.g. GGTT vs PPGTT). If we
 mask a command from a batch with 0x then an MI command may hash
 to the wrong bucket. If we want a perfect hash then I suppose we should
 look at bits 31:29 and mask with the exact STD_xx_OPCODE_MASK for the
 client. The existing INSTR_CLIENT_MASK/SHIFT defines could be reused
 for that.

Ah that works then, not super convinced that's the best way we can do
it, but it seems to be an improvement, so well... can't argue with that.

-- 
Damien
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-08 Thread Volkin, Bradley D
On Thu, May 08, 2014 at 08:45:07AM -0700, Ville Syrjälä wrote:
 On Thu, May 08, 2014 at 08:27:16AM -0700, Volkin, Bradley D wrote:
  On Thu, May 08, 2014 at 02:56:05AM -0700, Tvrtko Ursulin wrote:
   
   Hi Brad,
   
   On 04/28/2014 04:22 PM, bradley.d.vol...@intel.com wrote:
   [snip]
-   BUG_ON(!validate_cmds_sorted(ring));
+   BUG_ON(!validate_cmds_sorted(ring, cmd_tables, 
cmd_table_count));
BUG_ON(!validate_regs_sorted(ring));
+
+   BUG_ON(init_hash_table(ring, cmd_tables, cmd_table_count));
   
   Is a BUG_ON a bit harsh since the above fails only on ENOMEM condition?
   
   If the concern is not allowing any command execution if parser setup has 
   failed, it would be nicer to the system as whole to just keep rejecting 
   everything, but let the rest of the kernel live to enable debug or 
   whatever?
   
   I know it won't happen almost ever so it's a minor point really. I just 
   dislike actively hosing the whole system if it is avoidable.
  
  Hi Tvrtko,
  
  I agree that a BUG_ON might be harsh here. I suppose we could log an
  error and disable the command parser. Most command buffers would
  still go through fine but HW parsing would reject some that the SW
  parser might otherwise allow. That could be a bit odd if we ever did
  get a failure - apps/functionality that worked the last time I booted
  suddenly don't this time. The issue would be in the log though.
  
  I don't have a strong preference on this. Whatever people prefer.
 
 If the memory allocation fails there's probably not much point in
 trying to limp along and continue the driver init. So just pass error
 up and let the caller deal with it.
 
 Looking at the error paths up from ring init, we probably leak a ton
 of junk but at least the kernel should remain otherwise operational.

Passing the error up is probably a good option. There was a recent change
to mark the GPU as wedged if ring init fails, which seems like a reasonable
enough response to parser init failure. I'll have to look at the cleanup
paths. The per-ring parser init can probably move around to avoid excessive
additional cleanup code on failure paths.

Brad

 
 -- 
 Ville Syrjälä
 Intel OTC
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-08 Thread Volkin, Bradley D
On Thu, May 08, 2014 at 08:50:40AM -0700, Tvrtko Ursulin wrote:
 
 On 05/08/2014 04:27 PM, Volkin, Bradley D wrote:
  On Thu, May 08, 2014 at 02:56:05AM -0700, Tvrtko Ursulin wrote:
 
  Hi Brad,
 
  On 04/28/2014 04:22 PM, bradley.d.vol...@intel.com wrote:
  [snip]
  - BUG_ON(!validate_cmds_sorted(ring));
  + BUG_ON(!validate_cmds_sorted(ring, cmd_tables, cmd_table_count));
BUG_ON(!validate_regs_sorted(ring));
  +
  + BUG_ON(init_hash_table(ring, cmd_tables, cmd_table_count));
 
  Is a BUG_ON a bit harsh since the above fails only on ENOMEM condition?
 
  If the concern is not allowing any command execution if parser setup has
  failed, it would be nicer to the system as whole to just keep rejecting
  everything, but let the rest of the kernel live to enable debug or 
  whatever?
 
  I know it won't happen almost ever so it's a minor point really. I just
  dislike actively hosing the whole system if it is avoidable.
 
  Hi Tvrtko,
 
  I agree that a BUG_ON might be harsh here. I suppose we could log an
  error and disable the command parser. Most command buffers would
  still go through fine but HW parsing would reject some that the SW
  parser might otherwise allow. That could be a bit odd if we ever did
  get a failure - apps/functionality that worked the last time I booted
  suddenly don't this time. The issue would be in the log though.
 
 That would indeed be weird.
 
 So command parser is just a more permissive whitelist, a super set of 
 hardware parser?

Yes. The thing we're trying to accomplish is that HW rejects all
MI_LOAD/STORE_REGISTER_* commands, but it's safe and required to access
some registers from a batch in order to implement GL functionality.

 
  I don't have a strong preference on this. Whatever people prefer.
 
 Initially I was thinking of just rejecting all batch buffers if such 
 fundamental initialisation fails. It beats BUG_ON in the sense that it 
 leaves the system alive and it doesn't suffer from the random subtle 
 behaviour like if it would just disable the command parser.
 
 It is a bit theoretical because it is likely system won't be too healthy 
 anyway if such a tiny allocation fails but you never know. It's better 
 not to crash it if we don't have to.

Yeah, see my response to Ville. I think there's a good solution there.

Brad

 
 Regards,
 
 Tvrtko
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-05-07 Thread Volkin, Bradley D
Could someone help to review this patch please? It provides a nice
improvement to the command parser's performance, so I'd like to get
this one in.

Thanks,
Brad

On Mon, Apr 28, 2014 at 08:22:08AM -0700, Volkin, Bradley D wrote:
 From: Brad Volkin bradley.d.vol...@intel.com
 
 For clients that submit large batch buffers the command parser has
 a substantial impact on performance. On my HSW ULT system performance
 drops as much as ~20% on some tests. Most of the time is spent in the
 command lookup code. Converting that from the current naive search to
 a hash table lookup reduces the performance impact by as much as ~10%.
 
 The choice of value for I915_CMD_HASH_ORDER allows all commands
 currently used in the parser tables to hash to their own bucket (except
 for one collision on the render ring). The tradeoff is that it wastes
 memory. Because the opcodes for the commands in the tables are not
 particularly well distributed, reducing the order still leaves many
 buckets empty. The increased collisions don't seem to have a huge
 impact on the performance gain, but for now anyhow, the parser trades
 memory for performance.
 
 Signed-off-by: Brad Volkin bradley.d.vol...@intel.com
 ---
  drivers/gpu/drm/i915/i915_cmd_parser.c  | 136 
 
  drivers/gpu/drm/i915/i915_drv.h |   1 +
  drivers/gpu/drm/i915/intel_ringbuffer.c |   2 +
  drivers/gpu/drm/i915/intel_ringbuffer.h |  11 ++-
  4 files changed, 116 insertions(+), 34 deletions(-)
 
 diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c 
 b/drivers/gpu/drm/i915/i915_cmd_parser.c
 index 9bac097..9dca899 100644
 --- a/drivers/gpu/drm/i915/i915_cmd_parser.c
 +++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
 @@ -498,16 +498,18 @@ static u32 gen7_blt_get_cmd_length_mask(u32 cmd_header)
   return 0;
  }
  
 -static bool validate_cmds_sorted(struct intel_ring_buffer *ring)
 +static bool validate_cmds_sorted(struct intel_ring_buffer *ring,
 +  const struct drm_i915_cmd_table *cmd_tables,
 +  int cmd_table_count)
  {
   int i;
   bool ret = true;
  
 - if (!ring-cmd_tables || ring-cmd_table_count == 0)
 + if (!cmd_tables || cmd_table_count == 0)
   return true;
  
 - for (i = 0; i  ring-cmd_table_count; i++) {
 - const struct drm_i915_cmd_table *table = ring-cmd_tables[i];
 + for (i = 0; i  cmd_table_count; i++) {
 + const struct drm_i915_cmd_table *table = cmd_tables[i];
   u32 previous = 0;
   int j;
  
 @@ -557,6 +559,60 @@ static bool validate_regs_sorted(struct 
 intel_ring_buffer *ring)
ring-master_reg_count);
  }
  
 +struct cmd_node {
 + const struct drm_i915_cmd_descriptor *desc;
 + struct hlist_node node;
 +};
 +
 +/*
 + * Different command ranges have different numbers of bits for the opcode.
 + * In order to use the opcode bits, and only the opcode bits, for the hash 
 key
 + * we should use the MI_* command opcode mask (since those commands use the
 + * fewest bits for the opcode.)
 + */
 +#define CMD_HASH_MASK STD_MI_OPCODE_MASK
 +
 +static int init_hash_table(struct intel_ring_buffer *ring,
 +const struct drm_i915_cmd_table *cmd_tables,
 +int cmd_table_count)
 +{
 + int i, j;
 +
 + hash_init(ring-cmd_hash);
 +
 + for (i = 0; i  cmd_table_count; i++) {
 + const struct drm_i915_cmd_table *table = cmd_tables[i];
 +
 + for (j = 0; j  table-count; j++) {
 + const struct drm_i915_cmd_descriptor *desc =
 + table-table[j];
 + struct cmd_node *desc_node =
 + kmalloc(sizeof(*desc_node), GFP_KERNEL);
 +
 + if (!desc_node)
 + return -ENOMEM;
 +
 + desc_node-desc = desc;
 + hash_add(ring-cmd_hash, desc_node-node,
 +  desc-cmd.value  CMD_HASH_MASK);
 + }
 + }
 +
 + return 0;
 +}
 +
 +static void fini_hash_table(struct intel_ring_buffer *ring)
 +{
 + struct hlist_node *tmp;
 + struct cmd_node *desc_node;
 + int i;
 +
 + hash_for_each_safe(ring-cmd_hash, i, tmp, desc_node, node) {
 + hash_del(desc_node-node);
 + kfree(desc_node);
 + }
 +}
 +
  /**
   * i915_cmd_parser_init_ring() - set cmd parser related fields for a 
 ringbuffer
   * @ring: the ringbuffer to initialize
 @@ -567,18 +623,21 @@ static bool validate_regs_sorted(struct 
 intel_ring_buffer *ring)
   */
  void i915_cmd_parser_init_ring(struct intel_ring_buffer *ring)
  {
 + const struct drm_i915_cmd_table *cmd_tables;
 + int cmd_table_count;
 +
   if (!IS_GEN7(ring-dev))
   return;
  
   switch (ring-id) {
   case RCS:
   if (IS_HASWELL(ring-dev)) {
 - ring-cmd_tables = 

[Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-04-28 Thread bradley . d . volkin
From: Brad Volkin bradley.d.vol...@intel.com

For clients that submit large batch buffers the command parser has
a substantial impact on performance. On my HSW ULT system performance
drops as much as ~20% on some tests. Most of the time is spent in the
command lookup code. Converting that from the current naive search to
a hash table lookup reduces the performance impact by as much as ~10%.

The choice of value for I915_CMD_HASH_ORDER allows all commands
currently used in the parser tables to hash to their own bucket (except
for one collision on the render ring). The tradeoff is that it wastes
memory. Because the opcodes for the commands in the tables are not
particularly well distributed, reducing the order still leaves many
buckets empty. The increased collisions don't seem to have a huge
impact on the performance gain, but for now anyhow, the parser trades
memory for performance.

Signed-off-by: Brad Volkin bradley.d.vol...@intel.com
---
 drivers/gpu/drm/i915/i915_cmd_parser.c  | 136 
 drivers/gpu/drm/i915/i915_drv.h |   1 +
 drivers/gpu/drm/i915/intel_ringbuffer.c |   2 +
 drivers/gpu/drm/i915/intel_ringbuffer.h |  11 ++-
 4 files changed, 116 insertions(+), 34 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c 
b/drivers/gpu/drm/i915/i915_cmd_parser.c
index 9bac097..9dca899 100644
--- a/drivers/gpu/drm/i915/i915_cmd_parser.c
+++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
@@ -498,16 +498,18 @@ static u32 gen7_blt_get_cmd_length_mask(u32 cmd_header)
return 0;
 }
 
-static bool validate_cmds_sorted(struct intel_ring_buffer *ring)
+static bool validate_cmds_sorted(struct intel_ring_buffer *ring,
+const struct drm_i915_cmd_table *cmd_tables,
+int cmd_table_count)
 {
int i;
bool ret = true;
 
-   if (!ring-cmd_tables || ring-cmd_table_count == 0)
+   if (!cmd_tables || cmd_table_count == 0)
return true;
 
-   for (i = 0; i  ring-cmd_table_count; i++) {
-   const struct drm_i915_cmd_table *table = ring-cmd_tables[i];
+   for (i = 0; i  cmd_table_count; i++) {
+   const struct drm_i915_cmd_table *table = cmd_tables[i];
u32 previous = 0;
int j;
 
@@ -557,6 +559,60 @@ static bool validate_regs_sorted(struct intel_ring_buffer 
*ring)
 ring-master_reg_count);
 }
 
+struct cmd_node {
+   const struct drm_i915_cmd_descriptor *desc;
+   struct hlist_node node;
+};
+
+/*
+ * Different command ranges have different numbers of bits for the opcode.
+ * In order to use the opcode bits, and only the opcode bits, for the hash key
+ * we should use the MI_* command opcode mask (since those commands use the
+ * fewest bits for the opcode.)
+ */
+#define CMD_HASH_MASK STD_MI_OPCODE_MASK
+
+static int init_hash_table(struct intel_ring_buffer *ring,
+  const struct drm_i915_cmd_table *cmd_tables,
+  int cmd_table_count)
+{
+   int i, j;
+
+   hash_init(ring-cmd_hash);
+
+   for (i = 0; i  cmd_table_count; i++) {
+   const struct drm_i915_cmd_table *table = cmd_tables[i];
+
+   for (j = 0; j  table-count; j++) {
+   const struct drm_i915_cmd_descriptor *desc =
+   table-table[j];
+   struct cmd_node *desc_node =
+   kmalloc(sizeof(*desc_node), GFP_KERNEL);
+
+   if (!desc_node)
+   return -ENOMEM;
+
+   desc_node-desc = desc;
+   hash_add(ring-cmd_hash, desc_node-node,
+desc-cmd.value  CMD_HASH_MASK);
+   }
+   }
+
+   return 0;
+}
+
+static void fini_hash_table(struct intel_ring_buffer *ring)
+{
+   struct hlist_node *tmp;
+   struct cmd_node *desc_node;
+   int i;
+
+   hash_for_each_safe(ring-cmd_hash, i, tmp, desc_node, node) {
+   hash_del(desc_node-node);
+   kfree(desc_node);
+   }
+}
+
 /**
  * i915_cmd_parser_init_ring() - set cmd parser related fields for a ringbuffer
  * @ring: the ringbuffer to initialize
@@ -567,18 +623,21 @@ static bool validate_regs_sorted(struct intel_ring_buffer 
*ring)
  */
 void i915_cmd_parser_init_ring(struct intel_ring_buffer *ring)
 {
+   const struct drm_i915_cmd_table *cmd_tables;
+   int cmd_table_count;
+
if (!IS_GEN7(ring-dev))
return;
 
switch (ring-id) {
case RCS:
if (IS_HASWELL(ring-dev)) {
-   ring-cmd_tables = hsw_render_ring_cmds;
-   ring-cmd_table_count =
+   cmd_tables = hsw_render_ring_cmds;
+   cmd_table_count =
ARRAY_SIZE(hsw_render_ring_cmds);
} else {
- 

Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-04-28 Thread Daniel Vetter
On Mon, Apr 28, 2014 at 08:22:08AM -0700, bradley.d.vol...@intel.com wrote:
 From: Brad Volkin bradley.d.vol...@intel.com
 
 For clients that submit large batch buffers the command parser has
 a substantial impact on performance. On my HSW ULT system performance
 drops as much as ~20% on some tests. Most of the time is spent in the
 command lookup code. Converting that from the current naive search to
 a hash table lookup reduces the performance impact by as much as ~10%.
 
 The choice of value for I915_CMD_HASH_ORDER allows all commands
 currently used in the parser tables to hash to their own bucket (except
 for one collision on the render ring). The tradeoff is that it wastes
 memory. Because the opcodes for the commands in the tables are not
 particularly well distributed, reducing the order still leaves many
 buckets empty. The increased collisions don't seem to have a huge
 impact on the performance gain, but for now anyhow, the parser trades
 memory for performance.
 
 Signed-off-by: Brad Volkin bradley.d.vol...@intel.com

Nice. One idea on top which could be worth a shot is a bloomfilter to
handle all the non-special cases without a (likely) cache miss in the
hashtable. The per-ring bloomfilter would be only loaded once (and if we
place it nearby other stuff the cmdparser needs anyway even that is
amortized).

Also Chris mentioned that blitter loads under X are about the worst case
wrt impact of the cmdparser. Benchmarking x11perf might be a useful
extreme testcase for optimizing this. I guess Chris will jump in with more
ideas?

Thanks, Daniel

 ---
  drivers/gpu/drm/i915/i915_cmd_parser.c  | 136 
 
  drivers/gpu/drm/i915/i915_drv.h |   1 +
  drivers/gpu/drm/i915/intel_ringbuffer.c |   2 +
  drivers/gpu/drm/i915/intel_ringbuffer.h |  11 ++-
  4 files changed, 116 insertions(+), 34 deletions(-)
 
 diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c 
 b/drivers/gpu/drm/i915/i915_cmd_parser.c
 index 9bac097..9dca899 100644
 --- a/drivers/gpu/drm/i915/i915_cmd_parser.c
 +++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
 @@ -498,16 +498,18 @@ static u32 gen7_blt_get_cmd_length_mask(u32 cmd_header)
   return 0;
  }
  
 -static bool validate_cmds_sorted(struct intel_ring_buffer *ring)
 +static bool validate_cmds_sorted(struct intel_ring_buffer *ring,
 +  const struct drm_i915_cmd_table *cmd_tables,
 +  int cmd_table_count)
  {
   int i;
   bool ret = true;
  
 - if (!ring-cmd_tables || ring-cmd_table_count == 0)
 + if (!cmd_tables || cmd_table_count == 0)
   return true;
  
 - for (i = 0; i  ring-cmd_table_count; i++) {
 - const struct drm_i915_cmd_table *table = ring-cmd_tables[i];
 + for (i = 0; i  cmd_table_count; i++) {
 + const struct drm_i915_cmd_table *table = cmd_tables[i];
   u32 previous = 0;
   int j;
  
 @@ -557,6 +559,60 @@ static bool validate_regs_sorted(struct 
 intel_ring_buffer *ring)
ring-master_reg_count);
  }
  
 +struct cmd_node {
 + const struct drm_i915_cmd_descriptor *desc;
 + struct hlist_node node;
 +};
 +
 +/*
 + * Different command ranges have different numbers of bits for the opcode.
 + * In order to use the opcode bits, and only the opcode bits, for the hash 
 key
 + * we should use the MI_* command opcode mask (since those commands use the
 + * fewest bits for the opcode.)
 + */
 +#define CMD_HASH_MASK STD_MI_OPCODE_MASK
 +
 +static int init_hash_table(struct intel_ring_buffer *ring,
 +const struct drm_i915_cmd_table *cmd_tables,
 +int cmd_table_count)
 +{
 + int i, j;
 +
 + hash_init(ring-cmd_hash);
 +
 + for (i = 0; i  cmd_table_count; i++) {
 + const struct drm_i915_cmd_table *table = cmd_tables[i];
 +
 + for (j = 0; j  table-count; j++) {
 + const struct drm_i915_cmd_descriptor *desc =
 + table-table[j];
 + struct cmd_node *desc_node =
 + kmalloc(sizeof(*desc_node), GFP_KERNEL);
 +
 + if (!desc_node)
 + return -ENOMEM;
 +
 + desc_node-desc = desc;
 + hash_add(ring-cmd_hash, desc_node-node,
 +  desc-cmd.value  CMD_HASH_MASK);
 + }
 + }
 +
 + return 0;
 +}
 +
 +static void fini_hash_table(struct intel_ring_buffer *ring)
 +{
 + struct hlist_node *tmp;
 + struct cmd_node *desc_node;
 + int i;
 +
 + hash_for_each_safe(ring-cmd_hash, i, tmp, desc_node, node) {
 + hash_del(desc_node-node);
 + kfree(desc_node);
 + }
 +}
 +
  /**
   * i915_cmd_parser_init_ring() - set cmd parser related fields for a 
 ringbuffer
   * @ring: the ringbuffer to initialize
 @@ -567,18 +623,21 @@ static bool 

Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-04-28 Thread Volkin, Bradley D
On Mon, Apr 28, 2014 at 08:22:08AM -0700, Volkin, Bradley D wrote:
 From: Brad Volkin bradley.d.vol...@intel.com
 
 For clients that submit large batch buffers the command parser has
 a substantial impact on performance. On my HSW ULT system performance
 drops as much as ~20% on some tests. Most of the time is spent in the
 command lookup code. Converting that from the current naive search to
 a hash table lookup reduces the performance impact by as much as ~10%.

Tvrtko pointed out that what I wrote here is a bit ambiguous. To clarify:
Without the patch, perf drops 20%
With the patch, perf drops 10%

Brad

 
 The choice of value for I915_CMD_HASH_ORDER allows all commands
 currently used in the parser tables to hash to their own bucket (except
 for one collision on the render ring). The tradeoff is that it wastes
 memory. Because the opcodes for the commands in the tables are not
 particularly well distributed, reducing the order still leaves many
 buckets empty. The increased collisions don't seem to have a huge
 impact on the performance gain, but for now anyhow, the parser trades
 memory for performance.
 
 Signed-off-by: Brad Volkin bradley.d.vol...@intel.com
 ---
  drivers/gpu/drm/i915/i915_cmd_parser.c  | 136 
 
  drivers/gpu/drm/i915/i915_drv.h |   1 +
  drivers/gpu/drm/i915/intel_ringbuffer.c |   2 +
  drivers/gpu/drm/i915/intel_ringbuffer.h |  11 ++-
  4 files changed, 116 insertions(+), 34 deletions(-)
 
 diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c 
 b/drivers/gpu/drm/i915/i915_cmd_parser.c
 index 9bac097..9dca899 100644
 --- a/drivers/gpu/drm/i915/i915_cmd_parser.c
 +++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
 @@ -498,16 +498,18 @@ static u32 gen7_blt_get_cmd_length_mask(u32 cmd_header)
   return 0;
  }
  
 -static bool validate_cmds_sorted(struct intel_ring_buffer *ring)
 +static bool validate_cmds_sorted(struct intel_ring_buffer *ring,
 +  const struct drm_i915_cmd_table *cmd_tables,
 +  int cmd_table_count)
  {
   int i;
   bool ret = true;
  
 - if (!ring-cmd_tables || ring-cmd_table_count == 0)
 + if (!cmd_tables || cmd_table_count == 0)
   return true;
  
 - for (i = 0; i  ring-cmd_table_count; i++) {
 - const struct drm_i915_cmd_table *table = ring-cmd_tables[i];
 + for (i = 0; i  cmd_table_count; i++) {
 + const struct drm_i915_cmd_table *table = cmd_tables[i];
   u32 previous = 0;
   int j;
  
 @@ -557,6 +559,60 @@ static bool validate_regs_sorted(struct 
 intel_ring_buffer *ring)
ring-master_reg_count);
  }
  
 +struct cmd_node {
 + const struct drm_i915_cmd_descriptor *desc;
 + struct hlist_node node;
 +};
 +
 +/*
 + * Different command ranges have different numbers of bits for the opcode.
 + * In order to use the opcode bits, and only the opcode bits, for the hash 
 key
 + * we should use the MI_* command opcode mask (since those commands use the
 + * fewest bits for the opcode.)
 + */
 +#define CMD_HASH_MASK STD_MI_OPCODE_MASK
 +
 +static int init_hash_table(struct intel_ring_buffer *ring,
 +const struct drm_i915_cmd_table *cmd_tables,
 +int cmd_table_count)
 +{
 + int i, j;
 +
 + hash_init(ring-cmd_hash);
 +
 + for (i = 0; i  cmd_table_count; i++) {
 + const struct drm_i915_cmd_table *table = cmd_tables[i];
 +
 + for (j = 0; j  table-count; j++) {
 + const struct drm_i915_cmd_descriptor *desc =
 + table-table[j];
 + struct cmd_node *desc_node =
 + kmalloc(sizeof(*desc_node), GFP_KERNEL);
 +
 + if (!desc_node)
 + return -ENOMEM;
 +
 + desc_node-desc = desc;
 + hash_add(ring-cmd_hash, desc_node-node,
 +  desc-cmd.value  CMD_HASH_MASK);
 + }
 + }
 +
 + return 0;
 +}
 +
 +static void fini_hash_table(struct intel_ring_buffer *ring)
 +{
 + struct hlist_node *tmp;
 + struct cmd_node *desc_node;
 + int i;
 +
 + hash_for_each_safe(ring-cmd_hash, i, tmp, desc_node, node) {
 + hash_del(desc_node-node);
 + kfree(desc_node);
 + }
 +}
 +
  /**
   * i915_cmd_parser_init_ring() - set cmd parser related fields for a 
 ringbuffer
   * @ring: the ringbuffer to initialize
 @@ -567,18 +623,21 @@ static bool validate_regs_sorted(struct 
 intel_ring_buffer *ring)
   */
  void i915_cmd_parser_init_ring(struct intel_ring_buffer *ring)
  {
 + const struct drm_i915_cmd_table *cmd_tables;
 + int cmd_table_count;
 +
   if (!IS_GEN7(ring-dev))
   return;
  
   switch (ring-id) {
   case RCS:
   if (IS_HASWELL(ring-dev)) {
 - ring-cmd_tables = hsw_render_ring_cmds;
 -   

Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-04-28 Thread Daniel Vetter
On Mon, Apr 28, 2014 at 08:22:08AM -0700, bradley.d.vol...@intel.com wrote:
 From: Brad Volkin bradley.d.vol...@intel.com
 
 For clients that submit large batch buffers the command parser has
 a substantial impact on performance. On my HSW ULT system performance
 drops as much as ~20% on some tests. Most of the time is spent in the
 command lookup code. Converting that from the current naive search to
 a hash table lookup reduces the performance impact by as much as ~10%.
 
 The choice of value for I915_CMD_HASH_ORDER allows all commands
 currently used in the parser tables to hash to their own bucket (except
 for one collision on the render ring). The tradeoff is that it wastes
 memory. Because the opcodes for the commands in the tables are not
 particularly well distributed, reducing the order still leaves many
 buckets empty. The increased collisions don't seem to have a huge
 impact on the performance gain, but for now anyhow, the parser trades
 memory for performance.

For the collisions have you looked into pre-munging the key a bit so that
we use more bits? A few shifts and xors shouldn't affect perf much really.

Also since the tables are mostly empty we could just overflow to the next
hashtable entry, but unfortunately that would require a bit of custom
insert and lookup code.

Finally if we manage to get 0 collisions a WARN_ON would be good for
that, to make sure we don't accidentally regress.

Anyway just a few more ideas.
-Daniel

Finally if we manage to get 0 collisions a WARN_ON would be good for
that, to make sure we don't accidentally regress.

Anyway just a few more ideas.
-Daniel

 
 Signed-off-by: Brad Volkin bradley.d.vol...@intel.com
 ---
  drivers/gpu/drm/i915/i915_cmd_parser.c  | 136 
 
  drivers/gpu/drm/i915/i915_drv.h |   1 +
  drivers/gpu/drm/i915/intel_ringbuffer.c |   2 +
  drivers/gpu/drm/i915/intel_ringbuffer.h |  11 ++-
  4 files changed, 116 insertions(+), 34 deletions(-)
 
 diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c 
 b/drivers/gpu/drm/i915/i915_cmd_parser.c
 index 9bac097..9dca899 100644
 --- a/drivers/gpu/drm/i915/i915_cmd_parser.c
 +++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
 @@ -498,16 +498,18 @@ static u32 gen7_blt_get_cmd_length_mask(u32 cmd_header)
   return 0;
  }
  
 -static bool validate_cmds_sorted(struct intel_ring_buffer *ring)
 +static bool validate_cmds_sorted(struct intel_ring_buffer *ring,
 +  const struct drm_i915_cmd_table *cmd_tables,
 +  int cmd_table_count)
  {
   int i;
   bool ret = true;
  
 - if (!ring-cmd_tables || ring-cmd_table_count == 0)
 + if (!cmd_tables || cmd_table_count == 0)
   return true;
  
 - for (i = 0; i  ring-cmd_table_count; i++) {
 - const struct drm_i915_cmd_table *table = ring-cmd_tables[i];
 + for (i = 0; i  cmd_table_count; i++) {
 + const struct drm_i915_cmd_table *table = cmd_tables[i];
   u32 previous = 0;
   int j;
  
 @@ -557,6 +559,60 @@ static bool validate_regs_sorted(struct 
 intel_ring_buffer *ring)
ring-master_reg_count);
  }
  
 +struct cmd_node {
 + const struct drm_i915_cmd_descriptor *desc;
 + struct hlist_node node;
 +};
 +
 +/*
 + * Different command ranges have different numbers of bits for the opcode.
 + * In order to use the opcode bits, and only the opcode bits, for the hash 
 key
 + * we should use the MI_* command opcode mask (since those commands use the
 + * fewest bits for the opcode.)
 + */
 +#define CMD_HASH_MASK STD_MI_OPCODE_MASK
 +
 +static int init_hash_table(struct intel_ring_buffer *ring,
 +const struct drm_i915_cmd_table *cmd_tables,
 +int cmd_table_count)
 +{
 + int i, j;
 +
 + hash_init(ring-cmd_hash);
 +
 + for (i = 0; i  cmd_table_count; i++) {
 + const struct drm_i915_cmd_table *table = cmd_tables[i];
 +
 + for (j = 0; j  table-count; j++) {
 + const struct drm_i915_cmd_descriptor *desc =
 + table-table[j];
 + struct cmd_node *desc_node =
 + kmalloc(sizeof(*desc_node), GFP_KERNEL);
 +
 + if (!desc_node)
 + return -ENOMEM;
 +
 + desc_node-desc = desc;
 + hash_add(ring-cmd_hash, desc_node-node,
 +  desc-cmd.value  CMD_HASH_MASK);
 + }
 + }
 +
 + return 0;
 +}
 +
 +static void fini_hash_table(struct intel_ring_buffer *ring)
 +{
 + struct hlist_node *tmp;
 + struct cmd_node *desc_node;
 + int i;
 +
 + hash_for_each_safe(ring-cmd_hash, i, tmp, desc_node, node) {
 + hash_del(desc_node-node);
 + kfree(desc_node);
 + }
 +}
 +
  /**
   * i915_cmd_parser_init_ring() - set cmd parser related fields for a 
 ringbuffer

Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-04-28 Thread Volkin, Bradley D
On Mon, Apr 28, 2014 at 08:42:56AM -0700, Daniel Vetter wrote:
 On Mon, Apr 28, 2014 at 08:22:08AM -0700, bradley.d.vol...@intel.com wrote:
  From: Brad Volkin bradley.d.vol...@intel.com
  
  For clients that submit large batch buffers the command parser has
  a substantial impact on performance. On my HSW ULT system performance
  drops as much as ~20% on some tests. Most of the time is spent in the
  command lookup code. Converting that from the current naive search to
  a hash table lookup reduces the performance impact by as much as ~10%.
  
  The choice of value for I915_CMD_HASH_ORDER allows all commands
  currently used in the parser tables to hash to their own bucket (except
  for one collision on the render ring). The tradeoff is that it wastes
  memory. Because the opcodes for the commands in the tables are not
  particularly well distributed, reducing the order still leaves many
  buckets empty. The increased collisions don't seem to have a huge
  impact on the performance gain, but for now anyhow, the parser trades
  memory for performance.
  
  Signed-off-by: Brad Volkin bradley.d.vol...@intel.com
 
 Nice. One idea on top which could be worth a shot is a bloomfilter to
 handle all the non-special cases without a (likely) cache miss in the
 hashtable. The per-ring bloomfilter would be only loaded once (and if we
 place it nearby other stuff the cmdparser needs anyway even that is
 amortized).

Good suggestion. Noted.

 
 Also Chris mentioned that blitter loads under X are about the worst case
 wrt impact of the cmdparser. Benchmarking x11perf might be a useful
 extreme testcase for optimizing this. I guess Chris will jump in with more
 ideas?

Ok, I'll see how x11perf looks with and without this patch as a starting
point.

Brad

 
 Thanks, Daniel
 
  ---
   drivers/gpu/drm/i915/i915_cmd_parser.c  | 136 
  
   drivers/gpu/drm/i915/i915_drv.h |   1 +
   drivers/gpu/drm/i915/intel_ringbuffer.c |   2 +
   drivers/gpu/drm/i915/intel_ringbuffer.h |  11 ++-
   4 files changed, 116 insertions(+), 34 deletions(-)
  
  diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c 
  b/drivers/gpu/drm/i915/i915_cmd_parser.c
  index 9bac097..9dca899 100644
  --- a/drivers/gpu/drm/i915/i915_cmd_parser.c
  +++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
  @@ -498,16 +498,18 @@ static u32 gen7_blt_get_cmd_length_mask(u32 
  cmd_header)
  return 0;
   }
   
  -static bool validate_cmds_sorted(struct intel_ring_buffer *ring)
  +static bool validate_cmds_sorted(struct intel_ring_buffer *ring,
  +const struct drm_i915_cmd_table *cmd_tables,
  +int cmd_table_count)
   {
  int i;
  bool ret = true;
   
  -   if (!ring-cmd_tables || ring-cmd_table_count == 0)
  +   if (!cmd_tables || cmd_table_count == 0)
  return true;
   
  -   for (i = 0; i  ring-cmd_table_count; i++) {
  -   const struct drm_i915_cmd_table *table = ring-cmd_tables[i];
  +   for (i = 0; i  cmd_table_count; i++) {
  +   const struct drm_i915_cmd_table *table = cmd_tables[i];
  u32 previous = 0;
  int j;
   
  @@ -557,6 +559,60 @@ static bool validate_regs_sorted(struct 
  intel_ring_buffer *ring)
   ring-master_reg_count);
   }
   
  +struct cmd_node {
  +   const struct drm_i915_cmd_descriptor *desc;
  +   struct hlist_node node;
  +};
  +
  +/*
  + * Different command ranges have different numbers of bits for the opcode.
  + * In order to use the opcode bits, and only the opcode bits, for the hash 
  key
  + * we should use the MI_* command opcode mask (since those commands use the
  + * fewest bits for the opcode.)
  + */
  +#define CMD_HASH_MASK STD_MI_OPCODE_MASK
  +
  +static int init_hash_table(struct intel_ring_buffer *ring,
  +  const struct drm_i915_cmd_table *cmd_tables,
  +  int cmd_table_count)
  +{
  +   int i, j;
  +
  +   hash_init(ring-cmd_hash);
  +
  +   for (i = 0; i  cmd_table_count; i++) {
  +   const struct drm_i915_cmd_table *table = cmd_tables[i];
  +
  +   for (j = 0; j  table-count; j++) {
  +   const struct drm_i915_cmd_descriptor *desc =
  +   table-table[j];
  +   struct cmd_node *desc_node =
  +   kmalloc(sizeof(*desc_node), GFP_KERNEL);
  +
  +   if (!desc_node)
  +   return -ENOMEM;
  +
  +   desc_node-desc = desc;
  +   hash_add(ring-cmd_hash, desc_node-node,
  +desc-cmd.value  CMD_HASH_MASK);
  +   }
  +   }
  +
  +   return 0;
  +}
  +
  +static void fini_hash_table(struct intel_ring_buffer *ring)
  +{
  +   struct hlist_node *tmp;
  +   struct cmd_node *desc_node;
  +   int i;
  +
  +   hash_for_each_safe(ring-cmd_hash, i, tmp, desc_node, node) {
  +   hash_del(desc_node-node);
  +   

Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-04-28 Thread Volkin, Bradley D
On Mon, Apr 28, 2014 at 08:53:30AM -0700, Daniel Vetter wrote:
 On Mon, Apr 28, 2014 at 08:22:08AM -0700, bradley.d.vol...@intel.com wrote:
  From: Brad Volkin bradley.d.vol...@intel.com
  
  For clients that submit large batch buffers the command parser has
  a substantial impact on performance. On my HSW ULT system performance
  drops as much as ~20% on some tests. Most of the time is spent in the
  command lookup code. Converting that from the current naive search to
  a hash table lookup reduces the performance impact by as much as ~10%.
  
  The choice of value for I915_CMD_HASH_ORDER allows all commands
  currently used in the parser tables to hash to their own bucket (except
  for one collision on the render ring). The tradeoff is that it wastes
  memory. Because the opcodes for the commands in the tables are not
  particularly well distributed, reducing the order still leaves many
  buckets empty. The increased collisions don't seem to have a huge
  impact on the performance gain, but for now anyhow, the parser trades
  memory for performance.
 
 For the collisions have you looked into pre-munging the key a bit so that
 we use more bits? A few shifts and xors shouldn't affect perf much really.

I looked at this briefly but didn't find a substantial improvement. The
basic patch improved things enough that I wanted to just get it out. I
can look into this more, but I'd like to think about implementing the
batch buffer copy portion next. I don't want to optimize this, make people
happy, and then introduce another perf drop from the copy. Better to just
take the full hit now and then continue the improvements. Sound reasonable?

Brad

 
 Also since the tables are mostly empty we could just overflow to the next
 hashtable entry, but unfortunately that would require a bit of custom
 insert and lookup code.
 
 Finally if we manage to get 0 collisions a WARN_ON would be good for
 that, to make sure we don't accidentally regress.
 
 Anyway just a few more ideas.
 -Daniel
 
 Finally if we manage to get 0 collisions a WARN_ON would be good for
 that, to make sure we don't accidentally regress.
 
 Anyway just a few more ideas.
 -Daniel
 
  
  Signed-off-by: Brad Volkin bradley.d.vol...@intel.com
  ---
   drivers/gpu/drm/i915/i915_cmd_parser.c  | 136 
  
   drivers/gpu/drm/i915/i915_drv.h |   1 +
   drivers/gpu/drm/i915/intel_ringbuffer.c |   2 +
   drivers/gpu/drm/i915/intel_ringbuffer.h |  11 ++-
   4 files changed, 116 insertions(+), 34 deletions(-)
  
  diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c 
  b/drivers/gpu/drm/i915/i915_cmd_parser.c
  index 9bac097..9dca899 100644
  --- a/drivers/gpu/drm/i915/i915_cmd_parser.c
  +++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
  @@ -498,16 +498,18 @@ static u32 gen7_blt_get_cmd_length_mask(u32 
  cmd_header)
  return 0;
   }
   
  -static bool validate_cmds_sorted(struct intel_ring_buffer *ring)
  +static bool validate_cmds_sorted(struct intel_ring_buffer *ring,
  +const struct drm_i915_cmd_table *cmd_tables,
  +int cmd_table_count)
   {
  int i;
  bool ret = true;
   
  -   if (!ring-cmd_tables || ring-cmd_table_count == 0)
  +   if (!cmd_tables || cmd_table_count == 0)
  return true;
   
  -   for (i = 0; i  ring-cmd_table_count; i++) {
  -   const struct drm_i915_cmd_table *table = ring-cmd_tables[i];
  +   for (i = 0; i  cmd_table_count; i++) {
  +   const struct drm_i915_cmd_table *table = cmd_tables[i];
  u32 previous = 0;
  int j;
   
  @@ -557,6 +559,60 @@ static bool validate_regs_sorted(struct 
  intel_ring_buffer *ring)
   ring-master_reg_count);
   }
   
  +struct cmd_node {
  +   const struct drm_i915_cmd_descriptor *desc;
  +   struct hlist_node node;
  +};
  +
  +/*
  + * Different command ranges have different numbers of bits for the opcode.
  + * In order to use the opcode bits, and only the opcode bits, for the hash 
  key
  + * we should use the MI_* command opcode mask (since those commands use the
  + * fewest bits for the opcode.)
  + */
  +#define CMD_HASH_MASK STD_MI_OPCODE_MASK
  +
  +static int init_hash_table(struct intel_ring_buffer *ring,
  +  const struct drm_i915_cmd_table *cmd_tables,
  +  int cmd_table_count)
  +{
  +   int i, j;
  +
  +   hash_init(ring-cmd_hash);
  +
  +   for (i = 0; i  cmd_table_count; i++) {
  +   const struct drm_i915_cmd_table *table = cmd_tables[i];
  +
  +   for (j = 0; j  table-count; j++) {
  +   const struct drm_i915_cmd_descriptor *desc =
  +   table-table[j];
  +   struct cmd_node *desc_node =
  +   kmalloc(sizeof(*desc_node), GFP_KERNEL);
  +
  +   if (!desc_node)
  +   return -ENOMEM;
  +
  +   desc_node-desc = desc;
  + 

Re: [Intel-gfx] [PATCH] drm/i915: Use hash tables for the command parser

2014-04-28 Thread Daniel Vetter
On Mon, Apr 28, 2014 at 6:07 PM, Volkin, Bradley D
bradley.d.vol...@intel.com wrote:
 On Mon, Apr 28, 2014 at 08:53:30AM -0700, Daniel Vetter wrote:
 On Mon, Apr 28, 2014 at 08:22:08AM -0700, bradley.d.vol...@intel.com wrote:
  From: Brad Volkin bradley.d.vol...@intel.com
 
  For clients that submit large batch buffers the command parser has
  a substantial impact on performance. On my HSW ULT system performance
  drops as much as ~20% on some tests. Most of the time is spent in the
  command lookup code. Converting that from the current naive search to
  a hash table lookup reduces the performance impact by as much as ~10%.
 
  The choice of value for I915_CMD_HASH_ORDER allows all commands
  currently used in the parser tables to hash to their own bucket (except
  for one collision on the render ring). The tradeoff is that it wastes
  memory. Because the opcodes for the commands in the tables are not
  particularly well distributed, reducing the order still leaves many
  buckets empty. The increased collisions don't seem to have a huge
  impact on the performance gain, but for now anyhow, the parser trades
  memory for performance.

 For the collisions have you looked into pre-munging the key a bit so that
 we use more bits? A few shifts and xors shouldn't affect perf much really.

 I looked at this briefly but didn't find a substantial improvement. The
 basic patch improved things enough that I wanted to just get it out. I
 can look into this more, but I'd like to think about implementing the
 batch buffer copy portion next. I don't want to optimize this, make people
 happy, and then introduce another perf drop from the copy. Better to just
 take the full hit now and then continue the improvements. Sound reasonable?

Yeah, makes sense. Like I've said just throwing around ideas.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx