Re: [pcre-dev] Powerpc optimisation

Zoltán Herczeg Thu, 20 Aug 2015 01:26:27 -0700

Hi,

If I understand correctly, stalled-cycles-backend means the current instruction 
is depend on the result of another instruction. High stalling is not surprising 
for a JIT code, because the most frequent instructions are loads and branches. 
Optimizations eliminate most of the arithmetic part of a program. It seems 25% 
of all instructions was branches. Actually I don't see anything unusual here, 
IPC is low, but that is also usual for JIT code.


Is it possible to collect cache hit/miss and memory (not cache) load stall 
statistics? It would be interesting to compare this to x86, but I am not sure 
we can draw any conclusions from that. Memory dependency and cache statistics 
would be more important.

Regards,
Zoltan

Frederic Bonnard <[email protected]> írta:
>Hi Zoltán,
>I had a look with linux perf to see what it reports about statistic profiling
>and it seems that branch prediction is not our issue (branche-misses seems 
>low).
>Here is a simple run with pcre-jit on the slowest pattern :
>---
>ubuntu@vm10:~/bench/regex-test$ perf stat ./runtest 
>'mark.txt' loaded. (Length: 20045118 bytes)
>-----------------
>Regex: '.{0,3}(Tom|Sawyer|Huckleberry|Finn)'
>[pcre-jit] time:  1268 ms (3015 matches)
>
> Performance counter stats for './runtest':
>
>       6410.784480 task-clock (msec)         #    1.000 CPUs utilized          
>                24 context-switches          #    0.004 K/sec                  
>                 0 cpu-migrations            #    0.000 K/sec                  
>             5,409 page-faults               #    0.844 K/sec                  
>    21,217,505,532 cycles                    #    3.310 GHz                    
>  [66.69%]
>       347,697,697 stalled-cycles-frontend   #    1.64% frontend cycles idle   
>  [50.03%]
>    12,459,150,875 stalled-cycles-backend    #   58.72% backend  cycles idle   
>  [50.02%]
>    28,407,626,434 instructions              #    1.34  insns per cycle        
>                                             #    0.44  stalled cycles per 
> insn [66.68%]
>     7,000,623,877 branches                  # 1092.007 M/sec                  
>  [49.97%]
>       181,661,003 branch-misses             #    2.59% of all branches        
>  [50.00%]
>
>       6.411626272 seconds time elapsed
>---
>In my terminal, "58.72%" is in purple :) so maybe that is the source of 
>slowness.
>What do you think of that ?
>To use linux perf on jitted code, we would need to instrument the code as you 
>mention previously,
>and I found this :
>https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/jit-interface.txt
>
>I'm honestly stuck with here: I guess each line corresponds to a jitted 
>function with its start, length
>and symbol, but I don't know how to code that in pcre.
>
>Fred
>
>On Sat, 6 Jun 2015 18:33:28 +0200 (CEST), Zoltán Herczeg 
><[email protected]> wrote:
>> Hi Frederic,
>> 
>> I just realized that results on that page are two years old. So I updated 
>> the engines to their most recent versions and uploaded new results. These 
>> results are overall better for all engines (partly because of a newer gcc). 
>> The JIT is also improved overall, e.g. the 3rd starting from the last 
>> pattern was decreased to 27 ms from 190 ms.
>> 
>> Regards,
>> Zoltan
>> 
>> "Zoltán Herczeg" <[email protected]> írta:
>> >Hi Frederic,
>> >
>> >thank you for measuring PCRE on PPC. The results are quite interesting.
>> >
>> >It seems to me that those patterns are slower whose require heavy 
>> >backtracking. I mean where fast-forward (skipping) algorithms cannot be 
>> >used (or they match too frequently). The /[a-zA-Z]+ing/ is a good example 
>> >for that. Backtracking engines (PCRE, Oniguruma) suffers much more on PPC 
>> >than those that read input once (TRE, RE2). I suspect branch prediction on 
>> >x86 is better, but only statistics profilers can prove that. Oprofile is 
>> >available everywhere, and can profile JIT code. That part is developed by 
>> >IBM :)
>> >
>> >http://oprofile.sourceforge.net/doc/devel/index.html
>> >
>> >It needs some extra coding though. If you are interested to work on that, I 
>> >can help.
>> >
>> >Btw the Tom.{10,25}river|river.{10,25}Tom pattern is twice as fast on PPC 
>> >with JIT if I understand the numbers correctly.
>> >
>> >Regards,
>> >Zoltan
>> >
>> >Frederic Bonnard <[email protected]> írta:
>> >>Thanks Zoltan for the quick reply.
>> >>- Ok I think I got it for SSE2.
>> >>- For SIMD instructions, I fear I don't have currently the knowledge for 
>> >>that but
>> >>would be willing to learn/help.
>> >>- A good start would be that 3rd point, about current code and performance
>> >>  status on PPC vs x86.
>> >>  I reused http://sljit.sourceforge.net/regex_perf.html, I hope it is 
>> >> relevant.
>> >>  pcre directory has been updated to use latest 8.37 instead of 8.32.
>> >>  My VMs were :
>> >>  * x86-64 4x2.3GHz 4G memory on a x86-64 host
>> >>  * ppc64el 4x3GHz 4G memory on a P8 host
>> >>  * ppc64 4x3GHz 4G memory on a P8 host
>> >>  All were installed with Ubuntu 14.04 LTS.
>> >>  Note on Ubuntu for ppc64, default is to have binary in 32b running on a 
>> >> 64b
>> >>  kernel, thus the binary 'runtest' is 32b. Maybe I'd need to try with 64b
>> >>  binary.
>> >>  Here is attached the results for those 3 environments. The goal is not to
>> >>  find who's the best but rather find any odd behaviour. Also let's focus 
>> >> on
>> >>  pcre/pcre-jit .
>> >>  Any comment from experts eyes welcomed.
>> >>  On my side, I see very comparable results between ppc64/pcc64el so no 
>> >> major
>> >>  issue on ppc64el. Now, between x86 and ppc64el, the results for the 
>> >> latter
>> >>  seem overall weaker, all the more that the x86 VM has lower freq.
>> >>  Results would need maybe more repetition ? and percentage to compare but 
>> >> I
>> >>  already see some x2 or x3 time slower results for pcre-jit :
>> >>  .{0,3}(Tom|Sawyer|Huckleberry|Finn)
>> >>  [a-zA-Z]+ing
>> >>  ^[a-zA-Z]{0,4}ing[^a-zA-Z]
>> >>  [a-zA-Z]+ing$
>> >>  ^[a-zA-Z ]{5,}$
>> >>  ^.{16,20}$
>> >>  "[^"]{0,30}[?!\.]"
>> >>  Tom.{10,25}river|river.{10,25}Tom
>> >>
>> >>  Any special treatment for these that could make code generated on power 
>> >> weaker ?
>> >>
>> >>  Fred
>> >>
>> >>-- 
>> >>## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 
>> >
>> >
>> >-- 
>> >## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 
>> 
>> 
>


-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

Re: [pcre-dev] Powerpc optimisation

Reply via email to