(You replied to me directly, Kirinn, not to the mail list) Thanks Kirinn. The loop count will likely be only 6 iterations, at least in my test case which is stage 1 of the peephole optimizer in the i386 and x86-64 build of the compiler, consisting of (currently) 36 branches that call identically-structured functions.
In the plan I have currently, I won't need an extra comparison because it would be something like this: cmp (tableaddr), %value je @MatchFound cmovle %midindex, %loindex cmovge %midindex, %hiindex I can see it maybe being a waste on the last iteration, maybe the second-to-last too. It would require some empirical tests though. Gareth aka. Kit On Fri 03/08/18 23:07 , "Kirinn" kir...@mooncore.eu sent: > Each comparison and conditional jump takes 2 instructions immediately, > which is reduced to 1 instruction through macro-op fusion, which means a > single loop iteration becomes maybe 12% more costly (assuming simple > instructions)... but every processor varies, of course. > > Forward jumps are normally predicted to not be taken, so there is no > further execution penalty if it's not an exact match. > > The additional instructions do take up a little bit of space in the > instruction cache, and the conditional jumps may mess up the existing > branch prediction table, causing a presumably inconsequential penalty > slightly afterward. > > The penalty for a branch misprediction is a pipeline flush. According to > Agner Fog's research (https://www.agner.org/optimize/microarchi tecture.pdf > [1]), the penalty varies by processor. Pentium MMX 4-5 cycles, PPro/P2/P3 > 10-20 cycles, P4 25-100+ cycles, Core2 about 15 cycles, Nehalem 17+ cycles, > Sandy/Ivy Bridge 15+ cycles, Haswell+ 15-20 cycles, AMD Bulldozer etc ~20 > cycles, Ryzen ~18 cycles. Evidently, processors are designed to execute 2-5 > instructions per cycle, complicated a lot by micro-op splitting and > pipelining, so we might guesstimate a misprediction penalty to be worth > around 30-100 instructions? > > Assuming case blocks tend to be relatively small, running through without > the conditional checks is probably usually faster. But the only way to be > sure is to test on a bunch of processors... > > ~Kirinn > > On 08/03/2018 08:57 PM, J. Gareth Moreton wrote: > Hi everyone, > > So I'm pondering about attempting to implement the binary search system I > devised for case blocks. To find the correct case label, you need to > iterate a search loop a fixed maximum of ceil(log2(n)) times, where n is > the number of individual case values (excluding "else"). Given that this > is quite a low number and I managed to get the loop itself down to just 8 > instructions (not including the comparison and conditional jump at the > end), this could be unwrapped for low values of ceil(log2(n))... possibly > up to 7 or 8, I'd say. However, if an exact match is found early, how > serious is the penalty of having a conditional jump forward on each > unrolled iteration for x86-64, whether it's taken or not? (This would be > the equivalent of a Break command) > > Gareth aka. Kit > > __________________________________________ _____ > fpc-devel maillist - fpc- de...@lists.freepascal.org > http://lists.freepascal.org/cgi- bin/mailman/listinfo/fpc-devel > > > > Links: > ------ > [1] https://www.agner.org/optimize/microarchit ecture.pdf > > _______________________________________________ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel