Oh, and by the BTW, I already tried you fastest example last week and got 50x speed up, but that's works only for mops, so ...
Daniel Grunblatt. On Mon, 24 Dec 2001, Nicholas Clark wrote: > On Fri, Dec 21, 2001 at 12:03:51AM +0000, Tom Hughes wrote: > > > It looks like it is going to need some work before it can work for > > other instruction sets though, at least for RISC systems where the > > operands are typically encoded with the opcode as part of a single > > word and the range of immediate constants is often restricted. > > > > I'm thinking it will need some way of indicating field widths and > > shifts for the operands and opcode so they can be merged into an > > instruction word and also some way of handling a constant pool so > > that arbitrary addresses can be loaded using PC relative loads. > > Another thing that struck me on reading it was: > > =item C<B<&IR>>I<n> > > Place the address of the C<INTVAL> register specified in the I<n>th argument. > > > RISC chips have lots of general purpose registers. It's likely that there > will be enough spare that several can be used to map to parrot registers. > Say 4 are available, it would be useful to be able to say that an op > requires the value of rN and rM, and modifies rD. The JIT compiler would make > a sandwich with the code to read in N and M into two of the real CPU registers, > the op filling, and then some more code to write D back to memory. > However, if the JIT can see that N is already in memory from the previous > OP, or D is going to be used and modified by the next op, it can skip, defer > or whatever some of the memory reads and writes. > > [And provided the descriptions are this helpful it doesn't have to do it > immediately. It becomes possible to write a better optimising JIT that makes > sandwiches with multiple fillings or even Scooby Snacks, while the initial > JIT insists that the only recipe available is bread, 1 filling, bread] > > mops will be fast if > > REDO: sub I4, I4, I3 > if I4, REDO > > maps to > > REDO: > load I4 from memory (which will be in the L1 cache) > load I3 from memory > I4 = I4 - I3 > store I4 to memory > > load I4 from memory > is it 0? > goto REDO if true > > > it will be slightly faster if it maps to > > REDO: > load I4 from memory (which will be in the L1 cache) > load I3 from memory > I4 = I4 - I3 > store I4 to memory > > # I4 still in a CPU register > is it 0? > goto REDO if so > > and faster still if the JIT can see how to push things out of the loop: > > load I4 from memory > load I3 from memory > REDO: > I4 = I4 - I3 > > is it 0? > goto REDO if so > > store I4 to memory > > (does threading mess this idea up?) > > Nicholas Clark >