>From ChangeLog:
- "Kevin P. Lawton" <[EMAIL PROTECTED]>: Sun Jan 14 21:36:48 EST 2001
More enhancements to dt-testbed/proto2, and more notes in the README.
Modeled computed branches.
[Read only if you're interested in DT stuff]
>From dt-testbed/proto2/README:
Dynamic branches. I hacked some guest code which resembles
a dense switch statement like:
for (macro_loops=DT_MacroLoops; macro_loop>=0; macro_loops--) {
for (s=31; s>0; s--) {
switch () {
case 0: WORKLOAD(); break;
case 1: WORKLOAD(); break;
case 2: WORKLOAD(); break;
...
case 31: WORKLOAD(); break;
}
}
}
The inner loop races through the case targets. The outer loop just
repeats the inner loop. WORKLOAD() can be varied to be a NOP instruction,
or a repeating (variable by DT_MicroLoops) add cascade code block to
keep the CPU busy.
My hand coded guest uses a branch table lookup, like a compiler
would for a dense target. Such computed branches are worse than
static targets, since they always have to be computed.
My first effort generates DT code which always calls the branch
handler assembly routine which always saves all the guest state,
calls the C function, and restores guest state. This is not
nearly optimal; an initial hash table lookup could be coded inline
(with the downside of code bloat), or in the assembly shim before
all state is saved. The static branch case didn't force me to do
that yet, because once the target was found, the direct address was
backpatched. So the suboptimal handler case was not used enough
to matter.
Here are the results of the first effort:
workload microloops native DT factor(DT/native)
================================================================
NOP .52 9.69 18.6
add cascade 5 1.87 11.08 5.9
add cascade 10 3.59 12.83 3.6
add cascade 100 27.24 36.43 1.3
As you can see, always diverting branch lookups through the
C handler code is not very efficient. It takes too many non-overhead
instructions (workload) to average out the cost of the expensive computed
branch handling.
Fortunately, the initial hash table lookup and single cache line
oriented search can be done in an assembly shim quite simply.
If there is a miss, then the C code can be called. I'll try
that next for a second effort. I won't be able to get the
factor down as low as with static branches; computed branches
are more of a worse-case scenario.
-Kevin
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Kevin Lawton [EMAIL PROTECTED]
MandrakeSoft, Inc. Plex86 developer
http://www.linux-mandrake.com/ http://www.plex86.org/