Hi, If I understand correctly, stalled-cycles-backend means the current instruction is depend on the result of another instruction. High stalling is not surprising for a JIT code, because the most frequent instructions are loads and branches. Optimizations eliminate most of the arithmetic part of a program. It seems 25% of all instructions was branches. Actually I don't see anything unusual here, IPC is low, but that is also usual for JIT code.
Is it possible to collect cache hit/miss and memory (not cache) load stall statistics? It would be interesting to compare this to x86, but I am not sure we can draw any conclusions from that. Memory dependency and cache statistics would be more important. Regards, Zoltan Frederic Bonnard <[email protected]> írta: >Hi Zoltán, >I had a look with linux perf to see what it reports about statistic profiling >and it seems that branch prediction is not our issue (branche-misses seems >low). >Here is a simple run with pcre-jit on the slowest pattern : >--- >ubuntu@vm10:~/bench/regex-test$ perf stat ./runtest >'mark.txt' loaded. (Length: 20045118 bytes) >----------------- >Regex: '.{0,3}(Tom|Sawyer|Huckleberry|Finn)' >[pcre-jit] time: 1268 ms (3015 matches) > > Performance counter stats for './runtest': > > 6410.784480 task-clock (msec) # 1.000 CPUs utilized > 24 context-switches # 0.004 K/sec > 0 cpu-migrations # 0.000 K/sec > 5,409 page-faults # 0.844 K/sec > 21,217,505,532 cycles # 3.310 GHz > [66.69%] > 347,697,697 stalled-cycles-frontend # 1.64% frontend cycles idle > [50.03%] > 12,459,150,875 stalled-cycles-backend # 58.72% backend cycles idle > [50.02%] > 28,407,626,434 instructions # 1.34 insns per cycle > # 0.44 stalled cycles per > insn [66.68%] > 7,000,623,877 branches # 1092.007 M/sec > [49.97%] > 181,661,003 branch-misses # 2.59% of all branches > [50.00%] > > 6.411626272 seconds time elapsed >--- >In my terminal, "58.72%" is in purple :) so maybe that is the source of >slowness. >What do you think of that ? >To use linux perf on jitted code, we would need to instrument the code as you >mention previously, >and I found this : >https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/jit-interface.txt > >I'm honestly stuck with here: I guess each line corresponds to a jitted >function with its start, length >and symbol, but I don't know how to code that in pcre. > >Fred > >On Sat, 6 Jun 2015 18:33:28 +0200 (CEST), Zoltán Herczeg ><[email protected]> wrote: >> Hi Frederic, >> >> I just realized that results on that page are two years old. So I updated >> the engines to their most recent versions and uploaded new results. These >> results are overall better for all engines (partly because of a newer gcc). >> The JIT is also improved overall, e.g. the 3rd starting from the last >> pattern was decreased to 27 ms from 190 ms. >> >> Regards, >> Zoltan >> >> "Zoltán Herczeg" <[email protected]> írta: >> >Hi Frederic, >> > >> >thank you for measuring PCRE on PPC. The results are quite interesting. >> > >> >It seems to me that those patterns are slower whose require heavy >> >backtracking. I mean where fast-forward (skipping) algorithms cannot be >> >used (or they match too frequently). The /[a-zA-Z]+ing/ is a good example >> >for that. Backtracking engines (PCRE, Oniguruma) suffers much more on PPC >> >than those that read input once (TRE, RE2). I suspect branch prediction on >> >x86 is better, but only statistics profilers can prove that. Oprofile is >> >available everywhere, and can profile JIT code. That part is developed by >> >IBM :) >> > >> >http://oprofile.sourceforge.net/doc/devel/index.html >> > >> >It needs some extra coding though. If you are interested to work on that, I >> >can help. >> > >> >Btw the Tom.{10,25}river|river.{10,25}Tom pattern is twice as fast on PPC >> >with JIT if I understand the numbers correctly. >> > >> >Regards, >> >Zoltan >> > >> >Frederic Bonnard <[email protected]> írta: >> >>Thanks Zoltan for the quick reply. >> >>- Ok I think I got it for SSE2. >> >>- For SIMD instructions, I fear I don't have currently the knowledge for >> >>that but >> >>would be willing to learn/help. >> >>- A good start would be that 3rd point, about current code and performance >> >> status on PPC vs x86. >> >> I reused http://sljit.sourceforge.net/regex_perf.html, I hope it is >> >> relevant. >> >> pcre directory has been updated to use latest 8.37 instead of 8.32. >> >> My VMs were : >> >> * x86-64 4x2.3GHz 4G memory on a x86-64 host >> >> * ppc64el 4x3GHz 4G memory on a P8 host >> >> * ppc64 4x3GHz 4G memory on a P8 host >> >> All were installed with Ubuntu 14.04 LTS. >> >> Note on Ubuntu for ppc64, default is to have binary in 32b running on a >> >> 64b >> >> kernel, thus the binary 'runtest' is 32b. Maybe I'd need to try with 64b >> >> binary. >> >> Here is attached the results for those 3 environments. The goal is not to >> >> find who's the best but rather find any odd behaviour. Also let's focus >> >> on >> >> pcre/pcre-jit . >> >> Any comment from experts eyes welcomed. >> >> On my side, I see very comparable results between ppc64/pcc64el so no >> >> major >> >> issue on ppc64el. Now, between x86 and ppc64el, the results for the >> >> latter >> >> seem overall weaker, all the more that the x86 VM has lower freq. >> >> Results would need maybe more repetition ? and percentage to compare but >> >> I >> >> already see some x2 or x3 time slower results for pcre-jit : >> >> .{0,3}(Tom|Sawyer|Huckleberry|Finn) >> >> [a-zA-Z]+ing >> >> ^[a-zA-Z]{0,4}ing[^a-zA-Z] >> >> [a-zA-Z]+ing$ >> >> ^[a-zA-Z ]{5,}$ >> >> ^.{16,20}$ >> >> "[^"]{0,30}[?!\.]" >> >> Tom.{10,25}river|river.{10,25}Tom >> >> >> >> Any special treatment for these that could make code generated on power >> >> weaker ? >> >> >> >> Fred >> >> >> >>-- >> >>## List details at https://lists.exim.org/mailman/listinfo/pcre-dev >> > >> > >> >-- >> >## List details at https://lists.exim.org/mailman/listinfo/pcre-dev >> >> > -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
