Re: [pcre-dev] Powerpc optimisation

Frederic Bonnard Wed, 19 Aug 2015 08:33:36 -0700

Hi Zoltán,
I had a look with linux perf to see what it reports about statistic profiling
and it seems that branch prediction is not our issue (branche-misses seems low).
Here is a simple run with pcre-jit on the slowest pattern :
---
ubuntu@vm10:~/bench/regex-test$ perf stat ./runtest 
'mark.txt' loaded. (Length: 20045118 bytes)
-----------------
Regex: '.{0,3}(Tom|Sawyer|Huckleberry|Finn)'
[pcre-jit] time:  1268 ms (3015 matches)


 Performance counter stats for './runtest':

       6410.784480 task-clock (msec)         #    1.000 CPUs utilized          
                24 context-switches          #    0.004 K/sec                  
                 0 cpu-migrations            #    0.000 K/sec                  
             5,409 page-faults               #    0.844 K/sec                  
    21,217,505,532 cycles                    #    3.310 GHz                     
[66.69%]
       347,697,697 stalled-cycles-frontend   #    1.64% frontend cycles idle    
[50.03%]
    12,459,150,875 stalled-cycles-backend    #   58.72% backend  cycles idle    
[50.02%]
    28,407,626,434 instructions              #    1.34  insns per cycle        
                                             #    0.44  stalled cycles per insn 
[66.68%]
     7,000,623,877 branches                  # 1092.007 M/sec                   
[49.97%]
       181,661,003 branch-misses             #    2.59% of all branches         
[50.00%]

       6.411626272 seconds time elapsed
---
In my terminal, "58.72%" is in purple :) so maybe that is the source of 
slowness.
What do you think of that ?
To use linux perf on jitted code, we would need to instrument the code as you 
mention previously,
and I found this :
https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/jit-interface.txt

I'm honestly stuck with here: I guess each line corresponds to a jitted 
function with its start, length
and symbol, but I don't know how to code that in pcre.

Fred

On Sat, 6 Jun 2015 18:33:28 +0200 (CEST), Zoltán Herczeg <[email protected]> 
wrote:
> Hi Frederic,
> 
> I just realized that results on that page are two years old. So I updated the 
> engines to their most recent versions and uploaded new results. These results 
> are overall better for all engines (partly because of a newer gcc). The JIT 
> is also improved overall, e.g. the 3rd starting from the last pattern was 
> decreased to 27 ms from 190 ms.
> 
> Regards,
> Zoltan
> 
> "Zoltán Herczeg" <[email protected]> írta:
> >Hi Frederic,
> >
> >thank you for measuring PCRE on PPC. The results are quite interesting.
> >
> >It seems to me that those patterns are slower whose require heavy 
> >backtracking. I mean where fast-forward (skipping) algorithms cannot be used 
> >(or they match too frequently). The /[a-zA-Z]+ing/ is a good example for 
> >that. Backtracking engines (PCRE, Oniguruma) suffers much more on PPC than 
> >those that read input once (TRE, RE2). I suspect branch prediction on x86 is 
> >better, but only statistics profilers can prove that. Oprofile is available 
> >everywhere, and can profile JIT code. That part is developed by IBM :)
> >
> >http://oprofile.sourceforge.net/doc/devel/index.html
> >
> >It needs some extra coding though. If you are interested to work on that, I 
> >can help.
> >
> >Btw the Tom.{10,25}river|river.{10,25}Tom pattern is twice as fast on PPC 
> >with JIT if I understand the numbers correctly.
> >
> >Regards,
> >Zoltan
> >
> >Frederic Bonnard <[email protected]> írta:
> >>Thanks Zoltan for the quick reply.
> >>- Ok I think I got it for SSE2.
> >>- For SIMD instructions, I fear I don't have currently the knowledge for 
> >>that but
> >>would be willing to learn/help.
> >>- A good start would be that 3rd point, about current code and performance
> >>  status on PPC vs x86.
> >>  I reused http://sljit.sourceforge.net/regex_perf.html, I hope it is 
> >> relevant.
> >>  pcre directory has been updated to use latest 8.37 instead of 8.32.
> >>  My VMs were :
> >>  * x86-64 4x2.3GHz 4G memory on a x86-64 host
> >>  * ppc64el 4x3GHz 4G memory on a P8 host
> >>  * ppc64 4x3GHz 4G memory on a P8 host
> >>  All were installed with Ubuntu 14.04 LTS.
> >>  Note on Ubuntu for ppc64, default is to have binary in 32b running on a 
> >> 64b
> >>  kernel, thus the binary 'runtest' is 32b. Maybe I'd need to try with 64b
> >>  binary.
> >>  Here is attached the results for those 3 environments. The goal is not to
> >>  find who's the best but rather find any odd behaviour. Also let's focus on
> >>  pcre/pcre-jit .
> >>  Any comment from experts eyes welcomed.
> >>  On my side, I see very comparable results between ppc64/pcc64el so no 
> >> major
> >>  issue on ppc64el. Now, between x86 and ppc64el, the results for the latter
> >>  seem overall weaker, all the more that the x86 VM has lower freq.
> >>  Results would need maybe more repetition ? and percentage to compare but I
> >>  already see some x2 or x3 time slower results for pcre-jit :
> >>  .{0,3}(Tom|Sawyer|Huckleberry|Finn)
> >>  [a-zA-Z]+ing
> >>  ^[a-zA-Z]{0,4}ing[^a-zA-Z]
> >>  [a-zA-Z]+ing$
> >>  ^[a-zA-Z ]{5,}$
> >>  ^.{16,20}$
> >>  "[^"]{0,30}[?!\.]"
> >>  Tom.{10,25}river|river.{10,25}Tom
> >>
> >>  Any special treatment for these that could make code generated on power 
> >> weaker ?
> >>
> >>  Fred
> >>
> >>-- 
> >>## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 
> >
> >
> >-- 
> >## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 
> 
> 


-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

Re: [pcre-dev] Powerpc optimisation

Reply via email to