Hello;

On 07/25/2017 05:17 PM, Marathe, Yogesh wrote:
Hi Matt, Sorry for late reply, please see below.

-----Original Message-----
From: mesa-dev [mailto:mesa-dev-boun...@lists.freedesktop.org] On Behalf
Of Matt Turner
Sent: Saturday, July 22, 2017 12:12 AM
To: Muthukumar, Aravindan <aravindan.muthuku...@intel.com>
Cc: mesa-dev@lists.freedesktop.org; Marathe, Yogesh
<yogesh.mara...@intel.com>
Subject: Re: [Mesa-dev] [PATCH V3] i965 : Optimize atom state flag checks

On 07/21, aravindan.muthuku...@intel.com wrote:
From: Aravindan Muthukumar <aravindan.muthuku...@intel.com>

This patch improves CPI Rate(Cycles per Instruction) and branch miss
predict for i965. The function check_state() was showing CPI retired rate.

Performance stats with android:
- CPI retired lowered by 28% (lower is better)
- Branch missprediction lowered by 13% (lower is better)
- 3DMark improved by 2%

The dissassembly doesn't show difference, although above results were
observed with patch.


Yes this is true for V3 where we removed the function based on a review comment.

If there's no difference in the assembly then whatever measurements you have
taken must be noise.


No that's not guaranteed either. Lot of things still depend on how instructions 
are
aligned, sometimes even changing linking order gives different results where
disassemblies of individual functions remain same.
I applied the patch and inspected brw_state_upload.o. There are assembly
differences. I can produce the same assembly as this patch by just pulling the 
if
statement out of check_and_emit_atom() and into the caller. The replacement
of check_state() with a macro is completely unnecessary.


diff --git a/src/mesa/drivers/dri/i965/brw_state_upload.c
b/src/mesa/drivers/dri/i965/brw_state_upload.c
index acaa97ee7d..b163e1490d 100644
--- a/src/mesa/drivers/dri/i965/brw_state_upload.c
+++ b/src/mesa/drivers/dri/i965/brw_state_upload.c
@@ -439,14 +439,12 @@ merge_ctx_state(struct brw_context *brw,  }

  static inline void
-check_and_emit_atom(struct brw_context *brw,
-                    struct brw_state_flags *state,
-                    const struct brw_tracked_state *atom)
+emit_atom(struct brw_context *brw,
+          struct brw_state_flags *state,
+          const struct brw_tracked_state *atom)
  {
-   if (check_state(state, &atom->dirty)) {
-      atom->emit(brw);
-      merge_ctx_state(brw, state);
-   }
+   atom->emit(brw);
+   merge_ctx_state(brw, state);
  }

  static inline void
@@ -541,7 +539,9 @@ brw_upload_pipeline_state(struct brw_context *brw,
         const struct brw_tracked_state *atom = &atoms[i];
         struct brw_state_flags generated;

-         check_and_emit_atom(brw, &state, atom);
+         if (check_state(&state, &atom->dirty)) {
+            emit_atom(brw, &state, atom);
+         }

         accumulate_state(&examined, &atom->dirty);

@@ -558,7 +558,9 @@ brw_upload_pipeline_state(struct brw_context *brw,
        for (i = 0; i < num_atoms; i++) {
         const struct brw_tracked_state *atom = &atoms[i];

-         check_and_emit_atom(brw, &state, atom);
+         if (check_state(&state, &atom->dirty)) {
+            emit_atom(brw, &state, atom);
+         }
        }
     }


With that said, the assembly differences are not understandable to me.
Why should extracting an if statement from a static inline function into the 
caller
of that function cause any difference whatsoever?

Agreed, it shouldn't in case of static inline.


I'm viewing the assembly differences with:

wdiff -n \
     -w $'\033[30;41m' -x $'\033[0m' \
     -y $'\033[30;42m' -z $'\033[0m' \
     <(objdump -d brw_state_upload.o.before | sed -e 's/^.*\t//') \
     <(objdump -d brw_state_upload.o.wtf    | sed -e 's/^.*\t//') | less -R

and the only real difference is the movement of some call and jmp instructions.

I cannot take the patch without some reasonable explanation for the change.

Ok I think this has been discussed already and we agree that there is no big 
visible difference
in disassembly which can be pointed out for improvement. Although, as you said 
this movement
of instructions can cause this. If with this patch instructions get cache 
aligned that too can show
improvement.  This is a busy function with bad CPI. Hence chosen for 
optimization. Branch miss
predict is another metric.  Do we want to consider all these or just 
disassembly?

I don't have a reasonable explanation to give but I made some benchmark runs today (3DMark "Ice Storm Unlimited" test ) and I'm getting consistently 1-2% better results with the patch. I also tried the mentioned modification "pulling the if statement out of check_and_emit_atom() and into the caller" and it has same performance benefits. What I can tell from assembly dump is that brw_upload_render_state function becomes slightly shorter, there's movement with call and jmp and dump with the patch has less mov and nopl calls in total.


Let me make one more attempt, we clearly see icache misses for 
brw_upload_pipeline_state also
reduced with patch against without patch. Do you think that too is noise with 
*.o.wtf?
I would be happy to learn if that's the case to understand all this better.

I believe there are two reasons why this should go in
1. It consistently benefits android (we are ok to rename it), helps reduce 
overhead
2. It doesn't harm other platforms

We are ok to drop it if above two claims can be negated with someone else other 
than us testing
this. Otherwise can you please reconsider?
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Reply via email to