Oh, and by the BTW,
I already tried you fastest example last week and got 50x speed up, but
that's works only for mops, so ...

Daniel Grunblatt.


On Mon, 24 Dec 2001, Nicholas Clark wrote:

> On Fri, Dec 21, 2001 at 12:03:51AM +0000, Tom Hughes wrote:
>
> > It looks like it is going to need some work before it can work for
> > other instruction sets though, at least for RISC systems where the
> > operands are typically encoded with the opcode as part of a single
> > word and the range of immediate constants is often restricted.
> >
> > I'm thinking it will need some way of indicating field widths and
> > shifts for the operands and opcode so they can be merged into an
> > instruction word and also some way of handling a constant pool so
> > that arbitrary addresses can be loaded using PC relative loads.
>
> Another thing that struck me on reading it was:
>
> =item C<B<&IR>>I<n>
>
> Place the address of the C<INTVAL> register specified in the I<n>th argument.
>
>
> RISC chips have lots of general purpose registers. It's likely that there
> will be enough spare that several can be used to map to parrot registers.
> Say 4 are available, it would be useful to be able to say that an op
> requires the value of rN and rM, and modifies rD. The JIT compiler would make
> a sandwich with the code to read in N and M into two of the real CPU registers,
> the op filling, and then some more code to write D back to memory.
> However, if the JIT can see that N is already in memory from the previous
> OP, or D is going to be used and modified by the next op, it can skip, defer
> or whatever some of the memory reads and writes.
>
> [And provided the descriptions are this helpful it doesn't have to do it
> immediately. It becomes possible to write a better optimising JIT that makes
> sandwiches with multiple fillings or even Scooby Snacks, while the initial
> JIT insists that the only recipe available is bread, 1 filling, bread]
>
> mops will be fast if
>
> REDO:   sub    I4, I4, I3
>         if     I4, REDO
>
> maps to
>
> REDO:
>         load I4 from memory (which will be in the L1 cache)
>         load I3 from memory
>         I4 = I4 - I3
>         store I4 to memory
>
>         load I4 from memory
>         is it 0?
>         goto REDO if true
>
>
> it will be slightly faster if it maps to
>
> REDO:
>         load I4 from memory (which will be in the L1 cache)
>         load I3 from memory
>         I4 = I4 - I3
>         store I4 to memory
>
>         # I4 still in a CPU register
>         is it 0?
>         goto REDO if so
>
> and faster still if the JIT can see how to push things out of the loop:
>
>         load I4 from memory
>         load I3 from memory
> REDO:
>         I4 = I4 - I3
>
>         is it 0?
>         goto REDO if so
>
>         store I4 to memory
>
> (does threading mess this idea up?)
>
> Nicholas Clark
>

Reply via email to