Hi David,
On Wed, Jan 06, 2016 at 01:23:52PM +0000, David CARLIER wrote:
> Sure, it is mainly gcc 5.2/5.3, sometimes clang 3.6 depending the
> machine I was working on.
I'm back (late) on this patch series.
So at this point I'm seeing that the memmem() and ebmb_lookup() functions
are significantly affected by the change, even more on low register count
machines like 32-bit x86 where this adds extra stack manipulation because
the compiler has no way to know that the pointers are the same.
In the case of memmem(), the extra register usage changed the code to the
point of adding 4 local variables to the stack and causing pointers to be
read from the stack, modified and stored back onto the stack, as you can
see in the "diff -y" output below showing the beginning of the function,
the patched one is on the left and the original one on the right. As you
can see, stack offsets 0x40, 0x44, 0x48 and 0x4c which were previously
not used are now used to store copies of general purpose registers, which
will definitely impact the function's efficiency :
000022e0 <my_memmem>: 000022e0
<my_memmem>:
22e0: 55 push %ebp 22e0:
55 push %ebp
22e1: 57 push %edi 22e1:
57 push %edi
22e2: 56 push %esi 22e2:
56 push %esi
22e3: 53 push %ebx 22e3:
53 push %ebx
22e4: 83 ec 2c sub $0x2c,%esp | 22e4:
83 ec 1c sub $0x1c,%esp
22e7: 8b 4c 24 48 mov 0x48(%esp),%ec | 22e7:
8b 7c 24 38 mov 0x38(%esp),%ed
22eb: 8b 7c 24 4c mov 0x4c(%esp),%ed | 22eb:
8b 74 24 3c mov 0x3c(%esp),%es
22ef: 85 c9 test %ecx,%ecx | 22ef:
85 ff test %edi,%edi
22f1: 74 68 je 235b <my_memme | 22f1:
74 75 je 2368 <my_memme
22f3: 8b 54 24 40 mov 0x40(%esp),%ed | 22f3:
8b 4c 24 30 mov 0x30(%esp),%ec
22f7: 85 d2 test %edx,%edx | 22f7:
85 c9 test %ecx,%ecx
22f9: 74 60 je 235b <my_memme | 22f9:
74 6d je 2368 <my_memme
22fb: 39 7c 24 44 cmp %edi,0x44(%esp | 22fb:
39 74 24 34 cmp %esi,0x34(%esp
22ff: 72 5a jb 235b <my_memme | 22ff:
72 67 jb 2368 <my_memme
2301: 8b 44 24 48 mov 0x48(%esp),%ea | 2301:
8a 07 mov (%edi),%al
2305: 31 d2 xor %edx,%edx | 2303:
8b 54 24 30 mov 0x30(%esp),%ed
2307: 8b 5c 24 40 mov 0x40(%esp),%eb | 2307:
25 ff 00 00 00 and $0xff,%eax
230b: 31 ed xor %ebp,%ebp | 230c:
89 c5 mov %eax,%ebp
230d: 8a 10 mov (%eax),%dl | 230e:
eb 27 jmp 2337 <my_memme
230f: 89 54 24 1c mov %edx,0x1c(%esp | 2310:
8b 44 24 30 mov 0x30(%esp),%ea
2313: eb 20 jmp 2335 <my_memme | 2314:
8b 54 24 34 mov 0x34(%esp),%ed
2315: 8d 76 00 lea 0x0(%esi),%esi | 2318:
29 d8 sub %ebx,%eax
Regarding ebmb_lookup(), it's even worse because there's one load and
one store on the stack in the fast path for each bit in the tree descent
while it used to rely exclusively on registers even on such a machine.
I have not run any benchmarks but it's possible that the change is even
measurable on geolocation maps and stick-tables.
So I'll see what I can come up with by using only explicit type casts
where relevant. Normally it should not matter. At worst it would disable
strict aliasing, but we already disable it anyway.
I'll keep you informed.
Best regards,
Willy