Re: [Mesa-dev] [PATCH V3] i965 : Optimize atom state flag checks
Tapani Pälli <tapani.pa...@intel.com> writes: > Hello; > > On 07/25/2017 05:17 PM, Marathe, Yogesh wrote: >> Hi Matt, Sorry for late reply, please see below. >> >>> -Original Message- >>> From: mesa-dev [mailto:mesa-dev-boun...@lists.freedesktop.org] On Behalf >>> Of Matt Turner >>> Sent: Saturday, July 22, 2017 12:12 AM >>> To: Muthukumar, Aravindan <aravindan.muthuku...@intel.com> >>> Cc: mesa-dev@lists.freedesktop.org; Marathe, Yogesh >>> <yogesh.mara...@intel.com> >>> Subject: Re: [Mesa-dev] [PATCH V3] i965 : Optimize atom state flag checks >>> >>> On 07/21, aravindan.muthuku...@intel.com wrote: >>>> From: Aravindan Muthukumar <aravindan.muthuku...@intel.com> >>>> >>>> This patch improves CPI Rate(Cycles per Instruction) and branch miss >>>> predict for i965. The function check_state() was showing CPI retired rate. >>>> >>>> Performance stats with android: >>>> - CPI retired lowered by 28% (lower is better) >>>> - Branch missprediction lowered by 13% (lower is better) >>>> - 3DMark improved by 2% >>>> >>>> The dissassembly doesn't show difference, although above results were >>>> observed with patch. >>> >> >> Yes this is true for V3 where we removed the function based on a review >> comment. >> >>> If there's no difference in the assembly then whatever measurements you have >>> taken must be noise. >>> >> >> No that's not guaranteed either. Lot of things still depend on how >> instructions are >> aligned, sometimes even changing linking order gives different results where >> disassemblies of individual functions remain same. >> >>> I applied the patch and inspected brw_state_upload.o. There are assembly >>> differences. I can produce the same assembly as this patch by just pulling >>> the if >>> statement out of check_and_emit_atom() and into the caller. The replacement >>> of check_state() with a macro is completely unnecessary. >>> >>> >>> diff --git a/src/mesa/drivers/dri/i965/brw_state_upload.c >>> b/src/mesa/drivers/dri/i965/brw_state_upload.c >>> index acaa97ee7d..b163e1490d 100644 >>> --- a/src/mesa/drivers/dri/i965/brw_state_upload.c >>> +++ b/src/mesa/drivers/dri/i965/brw_state_upload.c >>> @@ -439,14 +439,12 @@ merge_ctx_state(struct brw_context *brw, } >>> >>> static inline void >>> -check_and_emit_atom(struct brw_context *brw, >>> -struct brw_state_flags *state, >>> -const struct brw_tracked_state *atom) >>> +emit_atom(struct brw_context *brw, >>> + struct brw_state_flags *state, >>> + const struct brw_tracked_state *atom) >>> { >>> - if (check_state(state, >dirty)) { >>> - atom->emit(brw); >>> - merge_ctx_state(brw, state); >>> - } >>> + atom->emit(brw); >>> + merge_ctx_state(brw, state); >>> } >>> >>> static inline void >>> @@ -541,7 +539,9 @@ brw_upload_pipeline_state(struct brw_context *brw, >>> const struct brw_tracked_state *atom = [i]; >>> struct brw_state_flags generated; >>> >>> - check_and_emit_atom(brw, , atom); >>> + if (check_state(, >dirty)) { >>> +emit_atom(brw, , atom); >>> + } >>> >>> accumulate_state(, >dirty); >>> >>> @@ -558,7 +558,9 @@ brw_upload_pipeline_state(struct brw_context *brw, >>> for (i = 0; i < num_atoms; i++) { >>> const struct brw_tracked_state *atom = [i]; >>> >>> - check_and_emit_atom(brw, , atom); >>> + if (check_state(, >dirty)) { >>> +emit_atom(brw, , atom); >>> + } >>> } >>> } >>> >>> >>> With that said, the assembly differences are not understandable to me. >>> Why should extracting an if statement from a static inline function into >>> the caller >>> of that function cause any difference whatsoever? >> >> Agreed, it shouldn't in case of static inline. >> >>> >>> I'm viewing the assembly differences with: >>> >>> wdiff -n \ >>> -w $'\033[30;41m' -x $'\033[0m' \ >>> -y $'\033[30;42m' -z $'\033[0m' \ >>> <(objdump -d brw_state_upload.
Re: [Mesa-dev] [PATCH V3] i965 : Optimize atom state flag checks
Hello; On 07/25/2017 05:17 PM, Marathe, Yogesh wrote: Hi Matt, Sorry for late reply, please see below. -Original Message- From: mesa-dev [mailto:mesa-dev-boun...@lists.freedesktop.org] On Behalf Of Matt Turner Sent: Saturday, July 22, 2017 12:12 AM To: Muthukumar, Aravindan <aravindan.muthuku...@intel.com> Cc: mesa-dev@lists.freedesktop.org; Marathe, Yogesh <yogesh.mara...@intel.com> Subject: Re: [Mesa-dev] [PATCH V3] i965 : Optimize atom state flag checks On 07/21, aravindan.muthuku...@intel.com wrote: From: Aravindan Muthukumar <aravindan.muthuku...@intel.com> This patch improves CPI Rate(Cycles per Instruction) and branch miss predict for i965. The function check_state() was showing CPI retired rate. Performance stats with android: - CPI retired lowered by 28% (lower is better) - Branch missprediction lowered by 13% (lower is better) - 3DMark improved by 2% The dissassembly doesn't show difference, although above results were observed with patch. Yes this is true for V3 where we removed the function based on a review comment. If there's no difference in the assembly then whatever measurements you have taken must be noise. No that's not guaranteed either. Lot of things still depend on how instructions are aligned, sometimes even changing linking order gives different results where disassemblies of individual functions remain same. I applied the patch and inspected brw_state_upload.o. There are assembly differences. I can produce the same assembly as this patch by just pulling the if statement out of check_and_emit_atom() and into the caller. The replacement of check_state() with a macro is completely unnecessary. diff --git a/src/mesa/drivers/dri/i965/brw_state_upload.c b/src/mesa/drivers/dri/i965/brw_state_upload.c index acaa97ee7d..b163e1490d 100644 --- a/src/mesa/drivers/dri/i965/brw_state_upload.c +++ b/src/mesa/drivers/dri/i965/brw_state_upload.c @@ -439,14 +439,12 @@ merge_ctx_state(struct brw_context *brw, } static inline void -check_and_emit_atom(struct brw_context *brw, -struct brw_state_flags *state, -const struct brw_tracked_state *atom) +emit_atom(struct brw_context *brw, + struct brw_state_flags *state, + const struct brw_tracked_state *atom) { - if (check_state(state, >dirty)) { - atom->emit(brw); - merge_ctx_state(brw, state); - } + atom->emit(brw); + merge_ctx_state(brw, state); } static inline void @@ -541,7 +539,9 @@ brw_upload_pipeline_state(struct brw_context *brw, const struct brw_tracked_state *atom = [i]; struct brw_state_flags generated; - check_and_emit_atom(brw, , atom); + if (check_state(, >dirty)) { +emit_atom(brw, , atom); + } accumulate_state(, >dirty); @@ -558,7 +558,9 @@ brw_upload_pipeline_state(struct brw_context *brw, for (i = 0; i < num_atoms; i++) { const struct brw_tracked_state *atom = [i]; - check_and_emit_atom(brw, , atom); + if (check_state(, >dirty)) { +emit_atom(brw, , atom); + } } } With that said, the assembly differences are not understandable to me. Why should extracting an if statement from a static inline function into the caller of that function cause any difference whatsoever? Agreed, it shouldn't in case of static inline. I'm viewing the assembly differences with: wdiff -n \ -w $'\033[30;41m' -x $'\033[0m' \ -y $'\033[30;42m' -z $'\033[0m' \ <(objdump -d brw_state_upload.o.before | sed -e 's/^.*\t//') \ <(objdump -d brw_state_upload.o.wtf| sed -e 's/^.*\t//') | less -R and the only real difference is the movement of some call and jmp instructions. I cannot take the patch without some reasonable explanation for the change. Ok I think this has been discussed already and we agree that there is no big visible difference in disassembly which can be pointed out for improvement. Although, as you said this movement of instructions can cause this. If with this patch instructions get cache aligned that too can show improvement. This is a busy function with bad CPI. Hence chosen for optimization. Branch miss predict is another metric. Do we want to consider all these or just disassembly? I don't have a reasonable explanation to give but I made some benchmark runs today (3DMark "Ice Storm Unlimited" test ) and I'm getting consistently 1-2% better results with the patch. I also tried the mentioned modification "pulling the if statement out of check_and_emit_atom() and into the caller" and it has same performance benefits. What I can tell from assembly dump is that brw_upload_render_state function becomes slightly shorter, there's movement with call and jmp and dump with the patch has less mov and nopl calls in total. Let me make one more attempt, we clearly see icache misses for brw_upload
Re: [Mesa-dev] [PATCH V3] i965 : Optimize atom state flag checks
Hi Matt, Sorry for late reply, please see below. > -Original Message- > From: mesa-dev [mailto:mesa-dev-boun...@lists.freedesktop.org] On Behalf > Of Matt Turner > Sent: Saturday, July 22, 2017 12:12 AM > To: Muthukumar, Aravindan <aravindan.muthuku...@intel.com> > Cc: mesa-dev@lists.freedesktop.org; Marathe, Yogesh > <yogesh.mara...@intel.com> > Subject: Re: [Mesa-dev] [PATCH V3] i965 : Optimize atom state flag checks > > On 07/21, aravindan.muthuku...@intel.com wrote: > >From: Aravindan Muthukumar <aravindan.muthuku...@intel.com> > > > >This patch improves CPI Rate(Cycles per Instruction) and branch miss > >predict for i965. The function check_state() was showing CPI retired rate. > > > >Performance stats with android: > >- CPI retired lowered by 28% (lower is better) > >- Branch missprediction lowered by 13% (lower is better) > >- 3DMark improved by 2% > > > >The dissassembly doesn't show difference, although above results were > >observed with patch. > Yes this is true for V3 where we removed the function based on a review comment. > If there's no difference in the assembly then whatever measurements you have > taken must be noise. > No that's not guaranteed either. Lot of things still depend on how instructions are aligned, sometimes even changing linking order gives different results where disassemblies of individual functions remain same. > I applied the patch and inspected brw_state_upload.o. There are assembly > differences. I can produce the same assembly as this patch by just pulling > the if > statement out of check_and_emit_atom() and into the caller. The replacement > of check_state() with a macro is completely unnecessary. > > > diff --git a/src/mesa/drivers/dri/i965/brw_state_upload.c > b/src/mesa/drivers/dri/i965/brw_state_upload.c > index acaa97ee7d..b163e1490d 100644 > --- a/src/mesa/drivers/dri/i965/brw_state_upload.c > +++ b/src/mesa/drivers/dri/i965/brw_state_upload.c > @@ -439,14 +439,12 @@ merge_ctx_state(struct brw_context *brw, } > > static inline void > -check_and_emit_atom(struct brw_context *brw, > -struct brw_state_flags *state, > -const struct brw_tracked_state *atom) > +emit_atom(struct brw_context *brw, > + struct brw_state_flags *state, > + const struct brw_tracked_state *atom) > { > - if (check_state(state, >dirty)) { > - atom->emit(brw); > - merge_ctx_state(brw, state); > - } > + atom->emit(brw); > + merge_ctx_state(brw, state); > } > > static inline void > @@ -541,7 +539,9 @@ brw_upload_pipeline_state(struct brw_context *brw, >const struct brw_tracked_state *atom = [i]; >struct brw_state_flags generated; > > - check_and_emit_atom(brw, , atom); > + if (check_state(, >dirty)) { > +emit_atom(brw, , atom); > + } > >accumulate_state(, >dirty); > > @@ -558,7 +558,9 @@ brw_upload_pipeline_state(struct brw_context *brw, >for (i = 0; i < num_atoms; i++) { >const struct brw_tracked_state *atom = [i]; > > - check_and_emit_atom(brw, , atom); > + if (check_state(, >dirty)) { > +emit_atom(brw, , atom); > + } >} > } > > > With that said, the assembly differences are not understandable to me. > Why should extracting an if statement from a static inline function into the > caller > of that function cause any difference whatsoever? Agreed, it shouldn't in case of static inline. > > I'm viewing the assembly differences with: > > wdiff -n \ > -w $'\033[30;41m' -x $'\033[0m' \ > -y $'\033[30;42m' -z $'\033[0m' \ > <(objdump -d brw_state_upload.o.before | sed -e 's/^.*\t//') \ > <(objdump -d brw_state_upload.o.wtf| sed -e 's/^.*\t//') | less -R > > and the only real difference is the movement of some call and jmp > instructions. > > I cannot take the patch without some reasonable explanation for the change. Ok I think this has been discussed already and we agree that there is no big visible difference in disassembly which can be pointed out for improvement. Although, as you said this movement of instructions can cause this. If with this patch instructions get cache aligned that too can show improvement. This is a busy function with bad CPI. Hence chosen for optimization. Branch miss predict is another metric. Do we want to consider all these or just disassembly? Let me make one more attempt, we clearly see icache misses for brw_upload_pipeline_state also reduced with patch against without patch. Do you think that too is noise with *.o.wtf? I would be happy t
Re: [Mesa-dev] [PATCH V3] i965 : Optimize atom state flag checks
On 07/21, aravindan.muthuku...@intel.com wrote: From: Aravindan MuthukumarThis patch improves CPI Rate(Cycles per Instruction) and branch miss predict for i965. The function check_state() was showing CPI retired rate. Performance stats with android: - CPI retired lowered by 28% (lower is better) - Branch missprediction lowered by 13% (lower is better) - 3DMark improved by 2% The dissassembly doesn't show difference, although above results were observed with patch. If there's no difference in the assembly then whatever measurements you have taken must be noise. I applied the patch and inspected brw_state_upload.o. There are assembly differences. I can produce the same assembly as this patch by just pulling the if statement out of check_and_emit_atom() and into the caller. The replacement of check_state() with a macro is completely unnecessary. diff --git a/src/mesa/drivers/dri/i965/brw_state_upload.c b/src/mesa/drivers/dri/i965/brw_state_upload.c index acaa97ee7d..b163e1490d 100644 --- a/src/mesa/drivers/dri/i965/brw_state_upload.c +++ b/src/mesa/drivers/dri/i965/brw_state_upload.c @@ -439,14 +439,12 @@ merge_ctx_state(struct brw_context *brw, } static inline void -check_and_emit_atom(struct brw_context *brw, -struct brw_state_flags *state, -const struct brw_tracked_state *atom) +emit_atom(struct brw_context *brw, + struct brw_state_flags *state, + const struct brw_tracked_state *atom) { - if (check_state(state, >dirty)) { - atom->emit(brw); - merge_ctx_state(brw, state); - } + atom->emit(brw); + merge_ctx_state(brw, state); } static inline void @@ -541,7 +539,9 @@ brw_upload_pipeline_state(struct brw_context *brw, const struct brw_tracked_state *atom = [i]; struct brw_state_flags generated; - check_and_emit_atom(brw, , atom); + if (check_state(, >dirty)) { +emit_atom(brw, , atom); + } accumulate_state(, >dirty); @@ -558,7 +558,9 @@ brw_upload_pipeline_state(struct brw_context *brw, for (i = 0; i < num_atoms; i++) { const struct brw_tracked_state *atom = [i]; - check_and_emit_atom(brw, , atom); + if (check_state(, >dirty)) { +emit_atom(brw, , atom); + } } } With that said, the assembly differences are not understandable to me. Why should extracting an if statement from a static inline function into the caller of that function cause any difference whatsoever? I'm viewing the assembly differences with: wdiff -n \ -w $'\033[30;41m' -x $'\033[0m' \ -y $'\033[30;42m' -z $'\033[0m' \ <(objdump -d brw_state_upload.o.before | sed -e 's/^.*\t//') \ <(objdump -d brw_state_upload.o.wtf| sed -e 's/^.*\t//') | less -R and the only real difference is the movement of some call and jmp instructions. I cannot take the patch without some reasonable explanation for the change. signature.asc Description: Digital signature ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
[Mesa-dev] [PATCH V3] i965 : Optimize atom state flag checks
From: Aravindan MuthukumarThis patch improves CPI Rate(Cycles per Instruction) and branch miss predict for i965. The function check_state() was showing CPI retired rate. Performance stats with android: - CPI retired lowered by 28% (lower is better) - Branch missprediction lowered by 13% (lower is better) - 3DMark improved by 2% The dissassembly doesn't show difference, although above results were observed with patch. V2: - Removed memset() change - Changed commit message as per review comments V3: - Indentation and changes to remove check_state as function Signed-off-by: Aravindan Muthukumar Signed-off-by: Yogesh Marathe Tested-by: Asish --- src/mesa/drivers/dri/i965/brw_defines.h | 4 src/mesa/drivers/dri/i965/brw_state_upload.c | 24 +++- 2 files changed, 15 insertions(+), 13 deletions(-) diff --git a/src/mesa/drivers/dri/i965/brw_defines.h b/src/mesa/drivers/dri/i965/brw_defines.h index 2a8dbf8..8c9a510 100644 --- a/src/mesa/drivers/dri/i965/brw_defines.h +++ b/src/mesa/drivers/dri/i965/brw_defines.h @@ -1687,3 +1687,7 @@ enum brw_pixel_shader_coverage_mask_mode { # define CSDBG2_CONSTANT_BUFFER_ADDRESS_OFFSET_DISABLE (1 << 4) #endif + +/* Checking the state of mesa and brw before emitting atoms */ +#define CHECK_BRW_STATE(a,b) ((a.mesa & b.mesa) | (a.brw & b.brw)) + diff --git a/src/mesa/drivers/dri/i965/brw_state_upload.c b/src/mesa/drivers/dri/i965/brw_state_upload.c index acaa97e..1846624 100644 --- a/src/mesa/drivers/dri/i965/brw_state_upload.c +++ b/src/mesa/drivers/dri/i965/brw_state_upload.c @@ -217,12 +217,6 @@ void brw_destroy_state( struct brw_context *brw ) /*** */ -static bool -check_state(const struct brw_state_flags *a, const struct brw_state_flags *b) -{ - return ((a->mesa & b->mesa) | (a->brw & b->brw)) != 0; -} - static void accumulate_state( struct brw_state_flags *a, const struct brw_state_flags *b ) { @@ -443,10 +437,8 @@ check_and_emit_atom(struct brw_context *brw, struct brw_state_flags *state, const struct brw_tracked_state *atom) { - if (check_state(state, >dirty)) { - atom->emit(brw); - merge_ctx_state(brw, state); - } + atom->emit(brw); + merge_ctx_state(brw, state); } static inline void @@ -541,7 +533,10 @@ brw_upload_pipeline_state(struct brw_context *brw, const struct brw_tracked_state *atom = [i]; struct brw_state_flags generated; - check_and_emit_atom(brw, , atom); + /* Checking the state and emitting atoms */ + if (CHECK_BRW_STATE(state, atom->dirty)) { +check_and_emit_atom(brw, , atom); + } accumulate_state(, >dirty); @@ -550,7 +545,7 @@ brw_upload_pipeline_state(struct brw_context *brw, * fail; */ xor_states(, , ); -assert(!check_state(, )); +assert(!CHECK_BRW_STATE(examined, generated)); prev = state; } } @@ -558,7 +553,10 @@ brw_upload_pipeline_state(struct brw_context *brw, for (i = 0; i < num_atoms; i++) { const struct brw_tracked_state *atom = [i]; - check_and_emit_atom(brw, , atom); + /* Checking the state and emitting atoms */ + if (CHECK_BRW_STATE(state, atom->dirty)) { +check_and_emit_atom(brw, , atom); + } } } -- 2.7.4 ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
[Mesa-dev] [PATCH V3] i965 : Optimize atom state flag checks
From: Aravindan MuthukumarThis patch improves CPI Rate(Cycles per Instruction) and branch miss predict for i965. The function check_state() was showing CPI retired rate. Performance stats with android: - CPI retired lowered by 28% (lower is better) - Branch missprediction lowered by 13% (lower is better) - 3DMark improved by 2% The dissassembly doesn't show difference, although above results were observed with patch. V2: - Removed memset() change - Changed commit message as per review comments V3: - Indentation in check_and_emit_atom and commit msg corrected Signed-off-by: Aravindan Muthukumar Signed-off-by: Yogesh Marathe Tested-by: Asish --- src/mesa/drivers/dri/i965/brw_defines.h | 4 src/mesa/drivers/dri/i965/brw_state_upload.c | 16 ++-- 2 files changed, 14 insertions(+), 6 deletions(-) diff --git a/src/mesa/drivers/dri/i965/brw_defines.h b/src/mesa/drivers/dri/i965/brw_defines.h index 2a8dbf8..8c9a510 100644 --- a/src/mesa/drivers/dri/i965/brw_defines.h +++ b/src/mesa/drivers/dri/i965/brw_defines.h @@ -1687,3 +1687,7 @@ enum brw_pixel_shader_coverage_mask_mode { # define CSDBG2_CONSTANT_BUFFER_ADDRESS_OFFSET_DISABLE (1 << 4) #endif + +/* Checking the state of mesa and brw before emitting atoms */ +#define CHECK_BRW_STATE(a,b) ((a.mesa & b.mesa) | (a.brw & b.brw)) + diff --git a/src/mesa/drivers/dri/i965/brw_state_upload.c b/src/mesa/drivers/dri/i965/brw_state_upload.c index acaa97e..57ac394 100644 --- a/src/mesa/drivers/dri/i965/brw_state_upload.c +++ b/src/mesa/drivers/dri/i965/brw_state_upload.c @@ -443,10 +443,8 @@ check_and_emit_atom(struct brw_context *brw, struct brw_state_flags *state, const struct brw_tracked_state *atom) { - if (check_state(state, >dirty)) { - atom->emit(brw); - merge_ctx_state(brw, state); - } + atom->emit(brw); + merge_ctx_state(brw, state); } static inline void @@ -541,7 +539,10 @@ brw_upload_pipeline_state(struct brw_context *brw, const struct brw_tracked_state *atom = [i]; struct brw_state_flags generated; - check_and_emit_atom(brw, , atom); + /* Checking the state and emitting atoms */ + if (CHECK_BRW_STATE(state, atom->dirty)) { +check_and_emit_atom(brw, , atom); + } accumulate_state(, >dirty); @@ -558,7 +559,10 @@ brw_upload_pipeline_state(struct brw_context *brw, for (i = 0; i < num_atoms; i++) { const struct brw_tracked_state *atom = [i]; - check_and_emit_atom(brw, , atom); + /* Checking the state and emitting atoms */ + if (CHECK_BRW_STATE(state, atom->dirty)) { +check_and_emit_atom(brw, , atom); + } } } -- 2.7.4 ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev