Hi Steve and all,

| The approach of using C Compiler generated code rather than writing a
| full compiler appeals to me:
| http://www.csc.uvic.ca/~csc586a/papers/ertlgregg04.pdf

> From: Steve Blackburn <[EMAIL PROTECTED]>
> Date: Tue, 24 May 2005 21:08:05 +1000

> >>They automatically build themselves
> >>simple JIT backends (by extracting fragments produced by the ahead of
> >>time compiler).  This sounds like a great way to achieve portability
> >>while achiving better performance than a conventional interpreter.

> >I guess it's a bit better or just comparable with a good interpreter.
>
> They say it is a lot better: "speedups of up to 1.87 over the fastest
> previous interpreter based technique, and performance comparable to
> simple native code compilers.

Their technique may be reasonable in cases,
but it is better to be aware of:

- On an Athlon processor, speedup (gforth-native over gforth-super) is
  less than those on PowerPC.  It is between 1.06 and 1.49.

- The machine code concatinating technique consumes much memory.
  In my experience, generated machine code is about 10 times larger
  than the original instructions in Java bytecode.

In the paper, the authors have not mentioned memory consumption of the
technique.  We cannot guess how much it is precisely, but it is
possible to be a big drawback.  Yes, we can say the same for the
approach taking a baseline compiler instead of an interpreter (like
Jikes RVM).  Memory consumption of the baseline compiler of Jike RVM
is very interesting.

And the next one does not reduce the value of the technique, but
better to know:

- The compared interpreter (gforth-super) could be improved
  by other techniques including stack caching.
  This means that their machine code concatinating technique may
  benefit from those remaining techniques.


> >In 1998, I have written such a JIT compiler concatinate code fragments
> >generated by GCC for each JVM instruction.
>
> Very interesting!
>
> >Unfortunately, the JIT was
> >slightly slower than an interpreter in Sun's Classic VM. The
> >interpreter was written in x86 assembly language and implements
> >dynamic stack caching with 2 registers and 3 states. It performs much
> >better than the previous interpreter written in C.
> >
> >Then I rewrote the JIT.

In reality, my old JIT compiler in C cannot be strictly compared to
the JDK interpreter in assembly language.

I try clarifying the situation.

                               inst. execution            stack handling
(1)Ertl&Gregg's gforth-super   direct threading      <2>  TOS in a register <2>
(2)Ertl&Gregg's gforth-native  concat'd machine code <1>  TOS in a register <2>
(3)JDK interpreter in C        switch threading      <3>  memory            <3>
(4)JDK interpreter in asm      direct threading      <2>  stack caching     <1>
(5)Shudo's first JIT           concat'd machine code <1>  memory            <3>
(6)Shudo's lightweight JIT     concat'd macihne code <1>  stack caching     <1>

Gforth-super (1) is the Ertl&Gregg's fastest interpreter compared to
the machine code concatinating technique (2).
(3) is the only interpreter which an ancient JDK 1.0.2 had.
(4) is an interpreter of JDK 1.1 written in assembly language.
(5) is my first JIT compiler which concatinates machine code
for each VM instruction.
(6) is shuJIT, the JIT compiler re-written after (5).

<1>-<3> mean the order of expected performance.
Machine code concatination proposed in the Ertl&Gregg's paper is
marked with <1> because it is the best in those in this table
for VM instruction execution.
(4) and (6) uses 2 machine registers for stack caching and
Ertl&Gregg's implementations uses one register
to cache the top of stack (TOS).

Ertl&Gregg's paper says that gforth-native (2) provided speedups up to
1.87 over gforth-super (1).

I wrote in a previous posting that (5) was slower than (4), but (5)
was inferior to (4) in stack handling. It is possible to be the reason
of the slowness and then I cannot say that the machine code
concatinating technique is not necessarily faster than a good
interpreter like (4).


The followings are other facts I experienced:

- Shudo's lightweight JIT (6) generates about 10 times larger native code
  compared with the original Java bytecode instructions.
  This is possible to be a drawback of an approach taking a baseline
  compiler instead of an interpreter.

  Does the Ertl&Gregg's machine concatinating technique suffer it?
  How about a baseline compiler of Jikes RVM?

- An interpreter of JDK 1.1 written in assembly (4) executed
  a numerical benchmark, Linpack about twice as fast as
  the previous interpreter in C (3).

  A lesson here is that an interpreter is often regarded as just slow
  but a faster one and a slower one are very different in performance.

- Register utilization is still important even for an interpreter.
  Interpreters (1) and (4), and lightweight JIT compilers (2) and (6)
  utilizes one or two machine registers to cache values around TOS.
  It is also important to map VM registers (e.g. PC) onto machine registers.

  This means a good interpreter cannot be implemented in a completely
  portable way.

  Note that I do not know how Ertl&Gregg's implementations (1) and (2)
  use a machine register to keep TOS.  I guess they have a little
  assembly code.


  Kazuyuki Shudo        [EMAIL PROTECTED]       http://www.shudo.net/

Reply via email to