Hi and thanks for the feedback, indeed the performance will be affected (greatly) as you demonstrate that s for sure. I completely forgot the context of my tests those days and regarding your following questions about compilers ; I might just have CFLAGS environment variable override whereas I did not notice. So if you ignore this patchset, I wouldn't be affected :-)
Kind regards. On 16 February 2016 at 16:33, Willy TARREAU <[email protected]> wrote: > Hi David, > > On Wed, Jan 06, 2016 at 01:23:52PM +0000, David CARLIER wrote: > > Sure, it is mainly gcc 5.2/5.3, sometimes clang 3.6 depending the > > machine I was working on. > > I'm back (late) on this patch series. > > So at this point I'm seeing that the memmem() and ebmb_lookup() functions > are significantly affected by the change, even more on low register count > machines like 32-bit x86 where this adds extra stack manipulation because > the compiler has no way to know that the pointers are the same. > > In the case of memmem(), the extra register usage changed the code to the > point of adding 4 local variables to the stack and causing pointers to be > read from the stack, modified and stored back onto the stack, as you can > see in the "diff -y" output below showing the beginning of the function, > the patched one is on the left and the original one on the right. As you > can see, stack offsets 0x40, 0x44, 0x48 and 0x4c which were previously > not used are now used to store copies of general purpose registers, which > will definitely impact the function's efficiency : > > 000022e0 <my_memmem>: 000022e0 > <my_memmem>: > 22e0: 55 push %ebp 22e0: > 55 push %ebp > 22e1: 57 push %edi 22e1: > 57 push %edi > 22e2: 56 push %esi 22e2: > 56 push %esi > 22e3: 53 push %ebx 22e3: > 53 push %ebx > 22e4: 83 ec 2c sub $0x2c,%esp | 22e4: > 83 ec 1c sub $0x1c,%esp > 22e7: 8b 4c 24 48 mov 0x48(%esp),%ec | 22e7: > 8b 7c 24 38 mov 0x38(%esp),%ed > 22eb: 8b 7c 24 4c mov 0x4c(%esp),%ed | 22eb: > 8b 74 24 3c mov 0x3c(%esp),%es > 22ef: 85 c9 test %ecx,%ecx | 22ef: > 85 ff test %edi,%edi > 22f1: 74 68 je 235b <my_memme | 22f1: > 74 75 je 2368 <my_memme > 22f3: 8b 54 24 40 mov 0x40(%esp),%ed | 22f3: > 8b 4c 24 30 mov 0x30(%esp),%ec > 22f7: 85 d2 test %edx,%edx | 22f7: > 85 c9 test %ecx,%ecx > 22f9: 74 60 je 235b <my_memme | 22f9: > 74 6d je 2368 <my_memme > 22fb: 39 7c 24 44 cmp %edi,0x44(%esp | 22fb: > 39 74 24 34 cmp %esi,0x34(%esp > 22ff: 72 5a jb 235b <my_memme | 22ff: > 72 67 jb 2368 <my_memme > 2301: 8b 44 24 48 mov 0x48(%esp),%ea | 2301: > 8a 07 mov (%edi),%al > 2305: 31 d2 xor %edx,%edx | 2303: > 8b 54 24 30 mov 0x30(%esp),%ed > 2307: 8b 5c 24 40 mov 0x40(%esp),%eb | 2307: > 25 ff 00 00 00 and $0xff,%eax > 230b: 31 ed xor %ebp,%ebp | 230c: > 89 c5 mov %eax,%ebp > 230d: 8a 10 mov (%eax),%dl | 230e: > eb 27 jmp 2337 <my_memme > 230f: 89 54 24 1c mov %edx,0x1c(%esp | 2310: > 8b 44 24 30 mov 0x30(%esp),%ea > 2313: eb 20 jmp 2335 <my_memme | 2314: > 8b 54 24 34 mov 0x34(%esp),%ed > 2315: 8d 76 00 lea 0x0(%esi),%esi | 2318: > 29 d8 sub %ebx,%eax > > Regarding ebmb_lookup(), it's even worse because there's one load and > one store on the stack in the fast path for each bit in the tree descent > while it used to rely exclusively on registers even on such a machine. > I have not run any benchmarks but it's possible that the change is even > measurable on geolocation maps and stick-tables. > > So I'll see what I can come up with by using only explicit type casts > where relevant. Normally it should not matter. At worst it would disable > strict aliasing, but we already disable it anyway. > > I'll keep you informed. > > Best regards, > Willy > >

