On Wed, Jul 8, 2015 at 4:53 PM, Chris Wilson <ch...@chris-wilson.co.uk> wrote: > On Wed, Jul 08, 2015 at 03:33:17PM -0700, Matt Turner wrote: >> On Wed, Jul 8, 2015 at 2:07 PM, Chris Wilson <ch...@chris-wilson.co.uk> >> wrote: >> > On Wed, Jul 08, 2015 at 02:00:02PM -0700, Matt Turner wrote: >> >> By keeping a pointer to the next available location, we reduce the >> >> number of memory accesses needed to write to the batchbuffer. >> >> >> >> A net ~7k reduction of .text size, 7.5k of which is from the change to >> >> intel_batchbuffer_emit_dword(). >> >> >> >> text data bss dec hex filename >> >> 4943740 195152 26192 5165084 4ed01c i965_dri.so before >> >> 4936804 195152 26192 5158148 4eb504 i965_dri.so after >> >> >> >> Combined with the previous patch, improves performance of Synmark >> >> OglBatch7 by 4.05914% +/- 1.49373% (n=270) on Haswell. >> >> --- >> >> Full disclosure: when testing on an IVB desktop, I measured a >> >> regression in the same benchmark of -4.19005% +/- 1.15188% (n=30). >> >> I don't have any explanation. >> > >> > The problem is that it seems to generate worse code with multiple >> > adjacent emit_dwords. I have seen similar regressions when doing the >> > same batch[index] to *batch++ elsewhere. >> > -Chris >> >> That is in conflict with the data I've provided. In fact, I started by >> noticing that if I added intel_batchbuffer_emit_dword* functions that >> took multiple dwords that there was a reduction in .text size, so it's >> something that I've considered. > > This is part of a disassembly of the index version of > gen6_viewport_state.c::upload_viewport_state_pointers() > (ivb with -march=native -Ofast) > > 0x0000000000000287 <+103>: lea 0x1(%rax),%esi > 0x000000000000028a <+106>: mov %si,0x22f24(%rdi) > 0x0000000000000291 <+113>: mov %ecx,(%rdx,%rax,4) > 0x0000000000000294 <+116>: movzwl 0x22f24(%rdi),%eax > 0x000000000000029b <+123>: mov 0x24330(%rdi),%ecx > 0x00000000000002a1 <+129>: mov 0x22f08(%rdi),%rdx > 0x00000000000002a8 <+136>: lea 0x1(%rax),%esi > 0x00000000000002ab <+139>: mov %si,0x22f24(%rdi) > 0x00000000000002b2 <+146>: mov %ecx,(%rdx,%rax,4) > 0x00000000000002b5 <+149>: movzwl 0x22f24(%rdi),%eax > 0x00000000000002bc <+156>: mov 0x2480c(%rdi),%ecx > 0x00000000000002c2 <+162>: mov 0x22f08(%rdi),%rdx > 0x00000000000002c9 <+169>: lea 0x1(%rax),%esi > 0x00000000000002cc <+172>: mov %si,0x22f24(%rdi) > 0x00000000000002d3 <+179>: mov %ecx,(%rdx,%rax,4) > 0x00000000000002d6 <+182>: pop %rbp > 0x00000000000002d7 <+183>: retq > > and for comparison the pointer version: > > 0x0000000000000280 <+96>: lea 0x4(%rax),%rcx > 0x0000000000000284 <+100>: mov %rcx,0x22f10(%rdi) > 0x000000000000028b <+107>: mov %edx,(%rax) > 0x000000000000028d <+109>: mov 0x22f10(%rdi),%rax > 0x0000000000000294 <+116>: mov 0x24338(%rdi),%edx > 0x000000000000029a <+122>: lea 0x4(%rax),%rcx > 0x000000000000029e <+126>: mov %rcx,0x22f10(%rdi) > 0x00000000000002a5 <+133>: mov %edx,(%rax) > 0x00000000000002a7 <+135>: mov 0x22f10(%rdi),%rax > 0x00000000000002ae <+142>: mov 0x24814(%rdi),%edx > 0x00000000000002b4 <+148>: lea 0x4(%rax),%rcx > 0x00000000000002b8 <+152>: mov %rcx,0x22f10(%rdi) > 0x00000000000002bf <+159>: mov %edx,(%rax) > 0x00000000000002c1 <+161>: pop %rbp > 0x00000000000002c2 <+162>: retq > > So in neither case does gcc avoid incrementing either the index or the > pointer in the struct after emit_dword, and then reloads it for the > next. > > This is what I expected to see > 0x000000000000025e <+62>: movl $0x780d1c02,(%rcx) > 0x0000000000000264 <+68>: mov 0x22f08(%rdi),%rax > 0x000000000000026b <+75>: mov 0x24320(%rdi),%edx > 0x0000000000000271 <+81>: mov %edx,0x4(%rax) > 0x0000000000000274 <+84>: mov 0x22f08(%rdi),%rax > 0x000000000000027b <+91>: mov 0x24338(%rdi),%edx > 0x0000000000000281 <+97>: mov %edx,0x8(%rax) > 0x0000000000000284 <+100>: mov 0x22f08(%rdi),%rax > 0x000000000000028b <+107>: mov 0x24814(%rdi),%edx > 0x0000000000000291 <+113>: mov %edx,0xc(%rax) > 0x0000000000000294 <+116>: addq $0x10,0x22f08(%rdi) > 0x000000000000029c <+124>: pop %rbp > 0x000000000000029d <+125>: retq > with the pointer increments coalesced to the end. Generated by > opencoding emit_dwords as > > static void upload_viewport_state_pointers(struct brw_context *brw) > { > BEGIN_BATCH(4); > brw->batch.map[0] = (_3DSTATE_VIEWPORT_STATE_POINTERS << 16 | (4 - 2) | > GEN6_CC_VIEWPORT_MODIFY | > GEN6_SF_VIEWPORT_MODIFY | > GEN6_CLIP_VIEWPORT_MODIFY); > brw->batch.map[1] = (brw->clip.vp_offset); > brw->batch.map[2] = (brw->sf.vp_offset); > brw->batch.map[3] = (brw->cc.vp_offset); > brw->batch.map += 4; > ADVANCE_BATCH(); > } > -Chris
Ah, thanks. I see. I'll give it another shot. _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev