On Wed, 15 Jul 2015 15:33:35 +0300 Oded Gabbay <oded.gab...@gmail.com> wrote:
> On Tue, Jul 14, 2015 at 11:41 AM, Siarhei Siamashka > <siarhei.siamas...@gmail.com> wrote: > > It is a good idea to have at least one benchmark result in the commit > > message. Or provide a convincing explanation why this particular code > > is beneficial. The "no changes were observed" commit message does not > > do a good job justifying the need for this patch. > > > > If none of the traces of cairo applications can show any improvements, > > then we can at least use "lowlevel-blt-bench src_8888_8888". I get > > the following results on my Playstation3: > > > > == before == > > > > src_8888_8888 = L1: 437.68 L2: 277.55 M:159.34 (242.58%) > > HT: 90.31 VT: 50.77 R: 50.67 RT: 17.64 ( 148Kops/s) > > > > == after == > > > > src_8888_8888 = L1: 850.60 L2: 453.91 M:174.26 (265.31%) > > HT:105.68 VT: 54.17 R: 54.88 RT: 18.72 ( 154Kops/s) > > > > > > Assuming that the commit message gets updated, > > Acked-by: Siarhei Siamashka <siarhei.siamas...@gmail.com> > > I went and tested the "lowlevel-blt-bench src_8888_8888" with this > patch and to my surprise, the fast path made the results much worse > (Except from L1): > > reference memcpy speed = 25058.0MB/s (6264.5MP/s for 32bpp fills) > > Before After Change > L1 6475.97 7430.21 +14.74% > L2 6019.82 4284.12 -28.83% > M 3004.82 2901.99 -3.42% > HT 1666.21 1278.17 -23.29% > VT 1719.05 1481.61 -13.81% > R 759.46 581.2 -23.47% > RT 218.86 181.07 -17.27% > Kops/s 1464 1292 -11.75% > > So I thought maybe I will see the improvement only in ppc, so I run it > on POWER7 ppc 64 bit: > > reference memcpy speed = 10651.0MB/s (2662.7MP/s for 32bpp fills) > > Before After Change > L1 4165.37 4228.5 +1.52% > L2 4337.16 4241.26 -2.21% > M 1678.75 1607.69 -4.23% > HT 886.92 808.52 -8.84% > VT 781.87 742.73 -5.01% > R 483.37 445.87 -7.76% > RT 175.08 165.24 -5.62% > Kops/s 1193 1135 -4.86% > > Nope, don't see the difference there as well. > > How did you check it on your machine ? Did you check with only > blt+copy area patches, or with other patches applied as well ? It's the difference between having all your patches applied and having all your patches applied, but with the blt+copy code commented out. > Currently, because both these patches don't show improvement in > lowlevel-blt-bench AND in cairo, I'm willing to drop them from this > patch-set until further investigation. Thanks. Yes, I also think that it's a good idea to drop this patch. My point was that every commit with performance needs to be justified by having benchmark results in the commit message. Now we have more than enough benchmarks and I'm quite happy about this. With this kind of optimization, we are competing with the memcpy implementation in glibc. Which is preferably expected to use hand written assembly there because memcpy performance is rather important. But the performance if memcpy in the standard C library is not guaranteed to be good. For example, pixman MIPS assembly code contains 'pixman_mips_fast_memcpy' function, which is just a general purpose memcpy implementation and we would not need to have it in the first place if the glibc code was good enough. Essentially, this kind of pixman code is a workaround for glibc memcpy performance problems. Of course, unless we are providing some extra tuning. Like for example, assume 4 byte granularity for x8r8g8b8 pixel data and reduce setup overhead, while the standard memcpy has a bit more work to do because it works with 1 byte granularity. Or in the case of ARM NEON code, we do software prefetch which can span to the next scanline too. We already don't accept workarounds for GCC bugs without references to reported issues in GCC bugzilla. Probably we should extend these rules to do the same with blt/memcpy? More detailed benchmark/profiling results from my Playstation3 (the current pixman git e2d211ac491cd9884aae7ccaf18e5b3042469cf2 without vmx patches): $ perf record ./lowlevel-blt-bench -m 500 src_8888_8888 && perf report 59.31% lowlevel-blt-be libc-2.15.so [.] _wordcopy_fwd_aligned 11.32% lowlevel-blt-be libc-2.15.so [.] __random 6.51% lowlevel-blt-be lowlevel-blt-bench [.] bench_L 5.68% lowlevel-blt-be libc-2.15.so [.] memcpy 3.03% lowlevel-blt-be [kernel.kallsyms] [k] .raw_local_irq_restore 1.80% lowlevel-blt-be lowlevel-blt-bench [.] pixman_image_composite32 1.35% lowlevel-blt-be libc-2.15.so [.] __random_r 1.10% lowlevel-blt-be lowlevel-blt-bench [.] analyze_extent 1.02% lowlevel-blt-be lowlevel-blt-bench [.] _pixman_image_validate 1.00% lowlevel-blt-be lowlevel-blt-bench [.] _pixman_compute_composite_region32 0.98% lowlevel-blt-be lowlevel-blt-bench [.] bench_RT 0.93% lowlevel-blt-be lowlevel-blt-bench [.] fast_composite_src_memcpy And here is the disassembly of '_wordcopy_fwd_aligned' (glibc-2.15): 00094118 <_wordcopy_fwd_aligned>: 94118: 94 21 ff e0 stwu r1,-32(r1) 9411c: 7d 88 02 a6 mflr r12 94120: 93 c1 00 18 stw r30,24(r1) 94124: 42 9f 00 05 bcl- 20,4*cr7+so,94128 <_wordcopy_fwd_aligned+0x10> 94128: 54 a9 16 fa rlwinm r9,r5,2,27,29 9412c: 7f c8 02 a6 mflr r30 94130: 3f de 00 0f addis r30,r30,15 94134: 7d 88 03 a6 mtlr r12 94138: 3b de be cc addi r30,r30,-16692 9413c: 81 5e e7 28 lwz r10,-6360(r30) 94140: 7d 2a 48 2e lwzx r9,r10,r9 94144: 7d 49 52 14 add r10,r9,r10 94148: 7d 49 03 a6 mtctr r10 9414c: 4e 80 04 20 bctr 94150: 81 44 00 00 lwz r10,0(r4) 94154: 38 84 ff f8 addi r4,r4,-8 94158: 39 23 ff f4 addi r9,r3,-12 9415c: 38 a5 00 02 addi r5,r5,2 94160: 81 04 00 0c lwz r8,12(r4) 94164: 91 49 00 0c stw r10,12(r9) 94168: 81 44 00 10 lwz r10,16(r4) 9416c: 91 09 00 10 stw r8,16(r9) 94170: 81 04 00 14 lwz r8,20(r4) 94174: 91 49 00 14 stw r10,20(r9) 94178: 34 a5 ff f8 addic. r5,r5,-8 9417c: 81 44 00 18 lwz r10,24(r4) 94180: 91 09 00 18 stw r8,24(r9) 94184: 81 04 00 1c lwz r8,28(r4) 94188: 91 49 00 1c stw r10,28(r9) 9418c: 38 69 00 20 addi r3,r9,32 94190: 41 82 00 90 beq- 94220 <_wordcopy_fwd_aligned+0x108> 94194: 38 84 00 20 addi r4,r4,32 94198: 81 44 00 00 lwz r10,0(r4) 9419c: 91 03 00 00 stw r8,0(r3) 941a0: 7c 69 1b 78 mr r9,r3 941a4: 81 04 00 04 lwz r8,4(r4) 941a8: 91 43 00 04 stw r10,4(r3) 941ac: 81 44 00 08 lwz r10,8(r4) 941b0: 91 09 00 08 stw r8,8(r9) 941b4: 81 04 00 0c lwz r8,12(r4) 941b8: 91 49 00 0c stw r10,12(r9) 941bc: 4b ff ff ac b 94168 <_wordcopy_fwd_aligned+0x50> 941c0: 81 04 00 00 lwz r8,0(r4) 941c4: 39 23 ff f8 addi r9,r3,-8 941c8: 38 84 ff fc addi r4,r4,-4 941cc: 38 a5 00 01 addi r5,r5,1 941d0: 4b ff ff dc b 941ac <_wordcopy_fwd_aligned+0x94> 941d4: 60 00 00 00 nop 941d8: 81 44 00 00 lwz r10,0(r4) 941dc: 38 63 ff fc addi r3,r3,-4 941e0: 4b ff ff c0 b 941a0 <_wordcopy_fwd_aligned+0x88> 941e4: 60 00 00 00 nop 941e8: 81 04 00 00 lwz r8,0(r4) 941ec: 38 a5 ff ff addi r5,r5,-1 941f0: 38 84 00 04 addi r4,r4,4 941f4: 4b ff ff a4 b 94198 <_wordcopy_fwd_aligned+0x80> 941f8: 38 a5 00 06 addi r5,r5,6 941fc: 81 44 00 00 lwz r10,0(r4) 94200: 38 84 ff e8 addi r4,r4,-24 94204: 34 a5 ff f8 addic. r5,r5,-8 94208: 81 04 00 1c lwz r8,28(r4) 9420c: 39 23 ff e4 addi r9,r3,-28 94210: 91 49 00 1c stw r10,28(r9) 94214: 38 69 00 20 addi r3,r9,32 94218: 40 82 ff 7c bne+ 94194 <_wordcopy_fwd_aligned+0x7c> 9421c: 60 00 00 00 nop 94220: 91 09 00 20 stw r8,32(r9) 94224: 83 c1 00 18 lwz r30,24(r1) 94228: 38 21 00 20 addi r1,r1,32 9422c: 4e 80 00 20 blr 94230: 81 04 00 00 lwz r8,0(r4) 94234: 39 23 ff e8 addi r9,r3,-24 94238: 38 84 ff ec addi r4,r4,-20 9423c: 38 a5 00 05 addi r5,r5,5 94240: 4b ff ff 38 b 94178 <_wordcopy_fwd_aligned+0x60> 94244: 60 00 00 00 nop 94248: 81 44 00 00 lwz r10,0(r4) 9424c: 39 23 ff ec addi r9,r3,-20 94250: 38 84 ff f0 addi r4,r4,-16 94254: 38 a5 00 04 addi r5,r5,4 94258: 4b ff ff 18 b 94170 <_wordcopy_fwd_aligned+0x58> 9425c: 60 00 00 00 nop 94260: 81 04 00 00 lwz r8,0(r4) 94264: 39 23 ff f0 addi r9,r3,-16 94268: 38 84 ff f4 addi r4,r4,-12 9426c: 38 a5 00 03 addi r5,r5,3 94270: 4b ff fe f8 b 94168 <_wordcopy_fwd_aligned+0x50> 94274: 60 00 00 00 nop The disassembly of the main loop in the vmx code was provided here: http://lists.freedesktop.org/archives/pixman/2015-July/003798.html b170: 39 69 00 10 addi r11,r9,16 b174: 7c 80 48 0c lvsl v4,r0,r9 b178: 38 a9 00 20 addi r5,r9,32 b17c: 7c a0 48 ce lvx v5,r0,r9 b180: 38 c9 00 30 addi r6,r9,48 b184: 7d 87 48 ce lvx v12,r7,r9 b188: 7c c0 58 0c lvsl v6,r0,r11 b18c: 39 29 00 40 addi r9,r9,64 b190: 7c e0 58 ce lvx v7,r0,r11 b194: 7d a7 58 ce lvx v13,r7,r11 b198: 39 6a 00 10 addi r11,r10,16 b19c: 7d 00 28 0c lvsl v8,r0,r5 b1a0: 7d 20 28 ce lvx v9,r0,r5 b1a4: 7c 27 28 ce lvx v1,r7,r5 b1a8: 38 aa 00 20 addi r5,r10,32 b1ac: 7d 40 30 0c lvsl v10,r0,r6 b1b0: 7d 60 30 ce lvx v11,r0,r6 b1b4: 7c 07 30 ce lvx v0,r7,r6 b1b8: 38 ca 00 30 addi r6,r10,48 b1bc: 11 85 61 2b vperm v12,v5,v12,v4 b1c0: 11 a7 69 ab vperm v13,v7,v13,v6 b1c4: 10 29 0a 2b vperm v1,v9,v1,v8 b1c8: 10 0b 02 ab vperm v0,v11,v0,v10 b1cc: 7d 80 51 ce stvx v12,r0,r10 b1d0: 39 4a 00 40 addi r10,r10,64 b1d4: 7d a0 59 ce stvx v13,r0,r11 b1d8: 7c 20 29 ce stvx v1,r0,r5 b1dc: 7c 00 31 ce stvx v0,r0,r6 b1e0: 42 00 ff 90 bdnz+ b170 <vmx_blt.part.4+0x260> And I already mentioned that I was not particularly happy about how unaligned loads are implemented on big endian powerpc systems. However even such non-perfect VMX code was still faster than non-VMX code in glibc-2.15 on my Plasytation3. I'll try to update glibc to a more recent version 2.20 (this system has not been updated for years) and check if it makes any difference. Either way, Playstation3/Cell is a dead end and I doubt that there are many of them in the whole world still capable of running Linux (this functionality had been officially crippled by firmware updates from Sony). Your results from POWER7/POWER8 are surely a lot more relevant. PS. We might want to revisit the SSE2 code and check if having the blt is justified there (benchmark it against the memcpy from modern glibc). -- Best regards, Siarhei Siamashka _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman