On 05/17/2018 03:55 PM, Segher Boessenkool wrote:
On Thu, May 17, 2018 at 12:49:58PM +0200, Christophe Leroy wrote:
In my 8xx configuration, I get 208 calls to memcmp()
Within those 208 calls, about half of them have constant sizes,
46 have a size of 8, 17 have a size of 16, only a few have a
size over 16. Other fixed sizes are mostly 4, 6 and 10.

This patch inlines calls to memcmp() when size
is constant and lower than or equal to 16

In my 8xx configuration, this reduces the number of calls
to memcmp() from 208 to 123

The following table shows the number of TB timeticks to perform
a constant size memcmp() before and after the patch depending on
the size

        Before  After   Improvement
01:      7577    5682   25%
02:     41668    5682   86%
03:     51137   13258   74%
04:     45455    5682   87%
05:     58713   13258   77%
06:     58712   13258   77%
07:     68183   20834   70%
08:     56819   15153   73%
09:     70077   28411   60%
10:     70077   28411   60%
11:     79546   35986   55%
12:     68182   28411   58%
13:     81440   35986   55%
14:     81440   39774   51%
15:     94697   43562   54%
16:     79546   37881   52%

Could you show results with a more recent GCC?  What version was this?

It was with the latest GCC version I have available in my environment, that is GCC 5.4. Is that too old ?

It seems that version inlines memcmp() when length is 1. All other lengths call memcmp()


What is this really measuring?  I doubt it takes 7577 (or 5682) timebase
ticks to do a 1-byte memcmp, which is just 3 instructions after all.

Well I looked again in my tests and it seems some results are wrong, can remember why, I probably did something wrong when I did the tests.

Anyway, the principle is to call a function tstcmpX() 100000 times from a loop, and getting the mftb before and after the loop. Then we remove from the elapsed time the time spent when calling tstcmp0() which is only a blr.
Therefore, we get really the time spent in the comparison only.

Here is the loop:

c06243b0:       7f ac 42 e6     mftb    r29
c06243b4:       3f 60 00 01     lis     r27,1
c06243b8:       63 7b 86 a0     ori     r27,r27,34464
c06243bc:       38 a0 00 02     li      r5,2
c06243c0:       7f c4 f3 78     mr      r4,r30
c06243c4:       7f 83 e3 78     mr      r3,r28
c06243c8:       4b 9e 8c 09     bl      c000cfd0 <tstcmp2>
c06243cc:       2c 1b 00 01     cmpwi   r27,1
c06243d0:       3b 7b ff ff     addi    r27,r27,-1
c06243d4:       40 82 ff e8     bne     c06243bc <setup_arch+0x294>
c06243d8:       7c ac 42 e6     mftb    r5
c06243dc:       7c bd 28 50     subf    r5,r29,r5
c06243e0:       7c bf 28 50     subf    r5,r31,r5
c06243e4:       38 80 00 02     li      r4,2
c06243e8:       7f 43 d3 78     mr      r3,r26
c06243ec:       4b a2 e4 45     bl      c0052830 <printk>

Before the patch:
c000cfc4 <tstcmp0>:
c000cfc4:       4e 80 00 20     blr

c000cfc8 <tstcmp1>:
c000cfc8:       38 a0 00 01     li      r5,1
c000cfcc:       48 00 72 08     b       c00141d4 <__memcmp>

c000cfd0 <tstcmp2>:
c000cfd0:       38 a0 00 02     li      r5,2
c000cfd4:       48 00 72 00     b       c00141d4 <__memcmp>

c000cfd8 <tstcmp3>:
c000cfd8:       38 a0 00 03     li      r5,3
c000cfdc:       48 00 71 f8     b       c00141d4 <__memcmp>

After the patch:
c000cfc4 <tstcmp0>:
c000cfc4:       4e 80 00 20     blr

c000cfd8 <tstcmp1>:
c000cfd8:       88 64 00 00     lbz     r3,0(r4)
c000cfdc:       89 25 00 00     lbz     r9,0(r5)
c000cfe0:       7c 69 18 50     subf    r3,r9,r3
c000cfe4:       4e 80 00 20     blr

c000cfe8 <tstcmp2>:
c000cfe8:       a0 64 00 00     lhz     r3,0(r4)
c000cfec:       a1 25 00 00     lhz     r9,0(r5)
c000cff0:       7c 69 18 50     subf    r3,r9,r3
c000cff4:       4e 80 00 20     blr

c000cff8 <tstcmp3>:
c000cff8:       a1 24 00 00     lhz     r9,0(r4)
c000cffc:       a0 65 00 00     lhz     r3,0(r5)
c000d000:       7c 63 48 51     subf.   r3,r3,r9
c000d004:       4c 82 00 20     bnelr
c000d008:       88 64 00 02     lbz     r3,2(r4)
c000d00c:       89 25 00 02     lbz     r9,2(r5)
c000d010:       7c 69 18 50     subf    r3,r9,r3
c000d014:       4e 80 00 20     blr

c000d018 <tstcmp4>:
c000d018:       80 64 00 00     lwz     r3,0(r4)
c000d01c:       81 25 00 00     lwz     r9,0(r5)
c000d020:       7c 69 18 50     subf    r3,r9,r3
c000d024:       4e 80 00 20     blr

c000d028 <tstcmp5>:
c000d028:       81 24 00 00     lwz     r9,0(r4)
c000d02c:       80 65 00 00     lwz     r3,0(r5)
c000d030:       7c 63 48 51     subf.   r3,r3,r9
c000d034:       4c 82 00 20     bnelr
c000d038:       88 64 00 04     lbz     r3,4(r4)
c000d03c:       89 25 00 04     lbz     r9,4(r5)
c000d040:       7c 69 18 50     subf    r3,r9,r3
c000d044:       4e 80 00 20     blr

c000d048 <tstcmp6>:
c000d048:       81 24 00 00     lwz     r9,0(r4)
c000d04c:       80 65 00 00     lwz     r3,0(r5)
c000d050:       7c 63 48 51     subf.   r3,r3,r9
c000d054:       4c 82 00 20     bnelr
c000d058:       a0 64 00 04     lhz     r3,4(r4)
c000d05c:       a1 25 00 04     lhz     r9,4(r5)
c000d060:       7c 69 18 50     subf    r3,r9,r3
c000d064:       4e 80 00 20     blr

c000d068 <tstcmp7>:
c000d068:       81 24 00 00     lwz     r9,0(r4)
c000d06c:       80 65 00 00     lwz     r3,0(r5)
c000d070:       7d 23 48 51     subf.   r9,r3,r9
c000d074:       40 82 00 20     bne     c000d094 <tstcmp7+0x2c>
c000d078:       a0 64 00 04     lhz     r3,4(r4)
c000d07c:       a1 25 00 04     lhz     r9,4(r5)
c000d080:       7d 29 18 51     subf.   r9,r9,r3
c000d084:       40 82 00 10     bne     c000d094 <tstcmp7+0x2c>
c000d088:       88 64 00 06     lbz     r3,6(r4)
c000d08c:       89 25 00 06     lbz     r9,6(r5)
c000d090:       7d 29 18 50     subf    r9,r9,r3
c000d094:       7d 23 4b 78     mr      r3,r9
c000d098:       4e 80 00 20     blr

c000d09c <tstcmp8>:
c000d09c:       81 25 00 04     lwz     r9,4(r5)
c000d0a0:       80 64 00 04     lwz     r3,4(r4)
c000d0a4:       81 04 00 00     lwz     r8,0(r4)
c000d0a8:       81 45 00 00     lwz     r10,0(r5)
c000d0ac:       7c 69 18 10     subfc   r3,r9,r3
c000d0b0:       7d 2a 41 10     subfe   r9,r10,r8
c000d0b4:       7d 2a fe 70     srawi   r10,r9,31
c000d0b8:       7d 48 4b 79     or.     r8,r10,r9
c000d0bc:       4d a2 00 20     bclr+   12,eq
c000d0c0:       7d 23 4b 78     mr      r3,r9
c000d0c4:       4e 80 00 20     blr

c000d0c8 <tstcmp9>:
c000d0c8:       81 25 00 04     lwz     r9,4(r5)
c000d0cc:       80 64 00 04     lwz     r3,4(r4)
c000d0d0:       81 04 00 00     lwz     r8,0(r4)
c000d0d4:       81 45 00 00     lwz     r10,0(r5)
c000d0d8:       7c 69 18 10     subfc   r3,r9,r3
c000d0dc:       7d 2a 41 10     subfe   r9,r10,r8
c000d0e0:       7d 2a fe 70     srawi   r10,r9,31
c000d0e4:       7d 48 4b 79     or.     r8,r10,r9
c000d0e8:       41 82 00 08     beq     c000d0f0 <tstcmp9+0x28>
c000d0ec:       7d 23 4b 78     mr      r3,r9
c000d0f0:       2f 83 00 00     cmpwi   cr7,r3,0
c000d0f4:       4c 9e 00 20     bnelr   cr7
c000d0f8:       88 64 00 08     lbz     r3,8(r4)
c000d0fc:       89 25 00 08     lbz     r9,8(r5)
c000d100:       7c 69 18 50     subf    r3,r9,r3
c000d104:       4e 80 00 20     blr

This shows that on PPC32, the 8 bytes comparison is not optimal, I will improve it.

We also see in tstcmp7() that GCC is a bit stupid, it should use r3 as result of the sub as he does with all previous ones, then do bnelr instead of bne+mr+blr

Below are the results of the measurement redone today:

    Before After  Improvment
01  24621   5681  77%
02  24621   5681  77%
03  34091  13257  61%
04  28409   5681  80%
05  41667  13257  68%
06  41668  13257  68%
07  51138  22727  56%
08  39772  15151  62%
09  53031  28409  46%
10  53031  28409  46%
11  62501  35986  42%
12  51137  28410  44%
13  64395  35985  44%
14  68182  39774  42%
15  73865  43560  41%
16  62500  37879  39%

We also see here that 08 is not optimal, it should have given same results as 05 and 06. I will keep it as is for PPC64 but will rewrite it as two 04 comparisons for PPC32

Christophe





Segher

Reply via email to