Re: Other interesting papers and research

2005-06-06 Thread David P Grove
[EMAIL PROTECTED] wrote on 06/05/2005 10:48:29 PM:


 - The machine code concatinating technique consumes much memory.
   In my experience, generated machine code is about 10 times larger
   than the original instructions in Java bytecode.
 
 In the paper, the authors have not mentioned memory consumption of the
 technique.  We cannot guess how much it is precisely, but it is
 possible to be a big drawback.  Yes, we can say the same for the
 approach taking a baseline compiler instead of an interpreter (like
 Jikes RVM).  Memory consumption of the baseline compiler of Jike RVM
 is very interesting.

It's platform dependent of course, but on IA32 isn't too horrible.  For 
example, running SPECjvm98 we see a 6.23x expansion from the Jikes RVM 
baseline compiler machine code bytes over bytecode bytes.  One thing to 
note is that a threaded interpreter would see something like a 2-4x 
expansion over normal bytecodes when it converts from bytecodes to its 
internal form (arrays of function pointers). So, a 6x expansion is 
probably only roughly 2x worse than some interpreted systems would 
actually see in practice. You can get this data out of Jikes RVM for your 
platform/program with -X:vm:measureCompilation=true.

--dave




Re: Other interesting papers and research

2005-06-06 Thread shudo
Hi Dave,

 From: David P Grove [EMAIL PROTECTED]

 [EMAIL PROTECTED] wrote on 06/05/2005 10:48:29 PM:

  - The machine code concatinating technique consumes much memory.
In my experience, generated machine code is about 10 times larger
than the original instructions in Java bytecode.
 
  In the paper, the authors have not mentioned memory consumption of the
  technique.  We cannot guess how much it is precisely, but it is
  possible to be a big drawback.  Yes, we can say the same for the
  approach taking a baseline compiler instead of an interpreter (like
  Jikes RVM).  Memory consumption of the baseline compiler of Jike RVM
  is very interesting.

 It's platform dependent of course, but on IA32 isn't too horrible.  For
 example, running SPECjvm98 we see a 6.23x expansion from the Jikes RVM
 baseline compiler machine code bytes over bytecode bytes.

Thanks for giving us such an useful number.
It looks reasonable.

 One thing to
 note is that a threaded interpreter would see something like a 2-4x
 expansion over normal bytecodes when it converts from bytecodes to its
 internal form (arrays of function pointers).

Direct threading interpreters like JDK's one work on plain Java
bytecode and they do not need to expand normal bytecode instructions.
Such expansion may have been required if Java bytecode is not linear
and rather a tree or other complicated form.

Then,

 So, a 6x expansion is
 probably only roughly 2x worse than some interpreted systems would
 actually see in practice.

We have to just say the baseline compiler of Jikes RVM generates 6x
larger native code than the original bytecode instructions.


For Java-written JVM, it seems to be natural to have a baseline
compiler instead of an interpreter.

It looks complicated to have an interpreter for a Java-written JVM. We
hope that the architecture of a JVM (e.g. interpreter or baseline
compiler) is independent of the language for implementing a certain
part of JVM. But there seems to be an implication between them.
Any comment?

  Kazuyuki Shudo[EMAIL PROTECTED]   http://www.shudo.net/


Re: Other interesting papers and research

2005-06-06 Thread Robert Lougher
Hi,

On 6/6/05, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 Hi Dave,
 
  From: David P Grove [EMAIL PROTECTED]
 
  One thing to
  note is that a threaded interpreter would see something like a 2-4x
  expansion over normal bytecodes when it converts from bytecodes to its
  internal form (arrays of function pointers).
 
 Direct threading interpreters like JDK's one work on plain Java
 bytecode and they do not need to expand normal bytecode instructions.
 Such expansion may have been required if Java bytecode is not linear
 and rather a tree or other complicated form.
 

According to my understanding, an indirect threaded interpreter uses
the original bytecode stream.  It's indirect because the handler
address must be looked up via the bytecode.  A direct threaded
interpreter removes this step by placing the actual handler addresses
in the rewritten instruction stream.

For what it's worth, JamVM supports both direct and indirect
threading, with static and dynamic stack-caching respectively.  I seem
to recall an average code size increase of ~4x for JamVM's internal
instruction format but I'll need to recheck my figures to be sure. 
Note, this is assuming a 32-bit architecture.  Handler addresses will
be double on a 64-bit machine, and the code increase over bytecodes
therefore larger.

Rob.


Re: Other interesting papers and research

2005-06-06 Thread shudo
Hi Rob,

 From: Robert Lougher [EMAIL PROTECTED]
 Date: Mon, 6 Jun 2005 14:58:45 +0100

   One thing to
   note is that a threaded interpreter would see something like a 2-4x
   expansion over normal bytecodes when it converts from bytecodes to its
   internal form (arrays of function pointers).
 
  Direct threading interpreters like JDK's one work on plain Java
  bytecode and they do not need to expand normal bytecode instructions.
  Such expansion may have been required if Java bytecode is not linear
  and rather a tree or other complicated form.

 According to my understanding, an indirect threaded interpreter uses
 the original bytecode stream.  It's indirect because the handler
 address must be looked up via the bytecode.

Ah, thanks for the indication.
My wording 'direct threading' was not correct.

Threading (interpreting) techniques I referred as implemented in JDKs
should be called 'token threading', neither direct nor indirect threading
because they work directly on bytecode instructions withought any expansion.
Note that the interpreter provides NEXT routines to for all native
code fragments corresponding to VM instructions.
For JVM, this wording like something threading is not very informative
because direct interpretation of portable bytecode is naturally
'token threading'.

Dave's last posting was based on direct threading technique and his saying
was correct about direct threading but my posting was incorrect in advance.

  Kazuyuki Shudo[EMAIL PROTECTED]   http://www.shudo.net/


Re: Other interesting papers and research

2005-06-05 Thread shudo
Hi Steve and all,

| The approach of using C Compiler generated code rather than writing a
| full compiler appeals to me:
| http://www.csc.uvic.ca/~csc586a/papers/ertlgregg04.pdf

 From: Steve Blackburn [EMAIL PROTECTED]
 Date: Tue, 24 May 2005 21:08:05 +1000

 They automatically build themselves
 simple JIT backends (by extracting fragments produced by the ahead of
 time compiler).  This sounds like a great way to achieve portability
 while achiving better performance than a conventional interpreter.

 I guess it's a bit better or just comparable with a good interpreter.

 They say it is a lot better: speedups of up to 1.87 over the fastest
 previous interpreter based technique, and performance comparable to
 simple native code compilers.

Their technique may be reasonable in cases,
but it is better to be aware of:

- On an Athlon processor, speedup (gforth-native over gforth-super) is
  less than those on PowerPC.  It is between 1.06 and 1.49.

- The machine code concatinating technique consumes much memory.
  In my experience, generated machine code is about 10 times larger
  than the original instructions in Java bytecode.

In the paper, the authors have not mentioned memory consumption of the
technique.  We cannot guess how much it is precisely, but it is
possible to be a big drawback.  Yes, we can say the same for the
approach taking a baseline compiler instead of an interpreter (like
Jikes RVM).  Memory consumption of the baseline compiler of Jike RVM
is very interesting.

And the next one does not reduce the value of the technique, but
better to know:

- The compared interpreter (gforth-super) could be improved
  by other techniques including stack caching.
  This means that their machine code concatinating technique may
  benefit from those remaining techniques.


 In 1998, I have written such a JIT compiler concatinate code fragments
 generated by GCC for each JVM instruction.

 Very interesting!

 Unfortunately, the JIT was
 slightly slower than an interpreter in Sun's Classic VM. The
 interpreter was written in x86 assembly language and implements
 dynamic stack caching with 2 registers and 3 states. It performs much
 better than the previous interpreter written in C.
 
 Then I rewrote the JIT.

In reality, my old JIT compiler in C cannot be strictly compared to
the JDK interpreter in assembly language.

I try clarifying the situation.

   inst. executionstack handling
(1)ErtlGregg's gforth-super   direct threading  2  TOS in a register 2
(2)ErtlGregg's gforth-native  concat'd machine code 1  TOS in a register 2
(3)JDK interpreter in Cswitch threading  3  memory3
(4)JDK interpreter in asm  direct threading  2  stack caching 1
(5)Shudo's first JIT   concat'd machine code 1  memory3
(6)Shudo's lightweight JIT concat'd macihne code 1  stack caching 1

Gforth-super (1) is the ErtlGregg's fastest interpreter compared to
the machine code concatinating technique (2).
(3) is the only interpreter which an ancient JDK 1.0.2 had.
(4) is an interpreter of JDK 1.1 written in assembly language.
(5) is my first JIT compiler which concatinates machine code
for each VM instruction.
(6) is shuJIT, the JIT compiler re-written after (5).

1-3 mean the order of expected performance.
Machine code concatination proposed in the ErtlGregg's paper is
marked with 1 because it is the best in those in this table
for VM instruction execution.
(4) and (6) uses 2 machine registers for stack caching and
ErtlGregg's implementations uses one register
to cache the top of stack (TOS).

ErtlGregg's paper says that gforth-native (2) provided speedups up to
1.87 over gforth-super (1).

I wrote in a previous posting that (5) was slower than (4), but (5)
was inferior to (4) in stack handling. It is possible to be the reason
of the slowness and then I cannot say that the machine code
concatinating technique is not necessarily faster than a good
interpreter like (4).


The followings are other facts I experienced:

- Shudo's lightweight JIT (6) generates about 10 times larger native code
  compared with the original Java bytecode instructions.
  This is possible to be a drawback of an approach taking a baseline
  compiler instead of an interpreter.

  Does the ErtlGregg's machine concatinating technique suffer it?
  How about a baseline compiler of Jikes RVM?

- An interpreter of JDK 1.1 written in assembly (4) executed
  a numerical benchmark, Linpack about twice as fast as
  the previous interpreter in C (3).

  A lesson here is that an interpreter is often regarded as just slow
  but a faster one and a slower one are very different in performance.

- Register utilization is still important even for an interpreter.
  Interpreters (1) and (4), and lightweight JIT compilers (2) and (6)
  utilizes one or two machine registers to cache values around TOS.
  It is also important to map VM registers (e.g. PC) onto machine registers.

  

Re: Other interesting papers and research

2005-05-24 Thread Steve Blackburn

[EMAIL PROTECTED] wrote:


They automatically build themselves
simple JIT backends (by extracting fragments produced by the ahead of
time compiler).  This sounds like a great way to achieve portability
while achiving better performance than a conventional interpreter.
   



I guess it's a bit better or just comparable with a good interpreter.
 

They say it is a lot better: speedups of up to 1.87 over the fastest 
previous interpreter based technique, and performance comparable to 
simple native code compilers. The effort required for retargeting our 
implementation from the 386 to the PPC architecture was less than a 
person day.



In 1998, I have written such a JIT compiler concatinate code fragments
generated by GCC for each JVM instruction.


Very interesting!


Unfortunately, the JIT was
slightly slower than an interpreter in Sun's Classic VM. The
interpreter was written in x86 assembly language and implements
dynamic stack caching with 2 registers and 3 states. It performs much
better than the previous interpreter written in C.

Then I rewrote the JIT.
 

It would be interesting to hear your perspective on Ertl  Gregg's 
approach.  Did they do something you had not done?  Do they have any 
particular insight?  You are in an excellent position to make a critical 
assessment of their work.



I am not very sure which is better for us, having a portable and so-so
baseline compiler or a good interpreter which is possibly less
portable than the compiler. There will be a trade off between memory
consumption, portability and so on.
 

Ideally we will have both as components and the  capacity to choose 
either or both depending on the build we are targetting.


Cheers,

--Steve


Re: Other interesting papers and research

2005-05-23 Thread Ariel Sabiguero Yawelak
Other interesting things that can be achieved are some sorts of high 
performance tunning aspects, which are very interesting, and using gcc 
power might be more interesting than redoing it from scratch, either, at 
the begining of current project, or maybe forever.
An adequate bundle of gcc and harmony might produce a JIT/WAT 
java/bytecode compilation. Moreover, the compilation parameters might be 
tuneable by the JVM administrator and choose between compilation 
speed, compilation performance, memory footprint, etc.
Appart from code-reusing, there is also an adequate sort of abstraction 
that is good here. and concentrating on this, we avoid discussing 
machine level details as we all agree that GCC is portable, performant 
and adequate.
Summing up, I support the idea of a java/bytecode to C compiler that can 
be bundled with gcc. As stated we would gain portability and we can use 
all facilities provided by gcc.


Ariel

Archie Cobbs wrote:


[EMAIL PROTECTED] wrote:


The approach of using C Compiler generated code rather than writing a
full compiler appeals to me:
http://www.csc.uvic.ca/~csc586a/papers/ertlgregg04.pdf

I am curious on how well the approach performs compared to existing 
JITs.



I'm admittedly biased, but the approach of using the C compiler has
some good benefits, mainly in portability. This is especially true for
architectures like x86 that have a complicated instruction set, where
optmization is a subtle art. Though JC uses the C compiler as a WAT
instead of a JIT, it is very portable (to any architecture that GCC
targets) as a result. To the extent that portability is a goal, this
might make sense as an approach to take, at least initially.

-Archie

__ 

Archie Cobbs  *CTO, Awarix*  
http://www.awarix.com







RE: Other interesting papers and research

2005-05-23 Thread Renaud BECHADE

>Summing up, I support the idea of a java/bytecode to C compiler that can
>be bundled with gcc. As stated we would gain portability and we can use
>all facilities provided by gcc.

To me it sounds a bit like gcj... In order to use a low level bytecode as an
intermediate representation, LLVM bytecode (which can be emitted by a
modified version of gcc [may be useful for rapid prototyping]) could be a
realization of your ideas, IMHO. Another way I see it would be to use a
simplified version of Java bytecode to represent low-level instructions (as
the Squeak Smalltalk VM works: only static methods and a restricted class
set is supported, this way it is possible to run the VM in itself, and also
to machine-generate C code from the VM sources. In modern days we might
consider generating LLVM bytecode - or simply use gcj! For this kind of
architecture, for instance [1])

Regards,

RB

[1] not sure about the implications in terms of speed, though. In my point
of view it is better, anyway, to have a working system first and then hook
bestial optimizations latter [once the execution model is simple enough or
rather close enough to a reasonable machine abstraction this might be
somewhat easier/less difficult]

PS: I am favorable to use a "java in java" VM as a plugin to such a system.
(or rather, have a "stock" VM with fast uptimes as bootstrap and then a
"very optimizing" plugin, which does not need to tackle with anything but
code generation itself - if this is really possible)

-Original Message-
From: Ariel Sabiguero Yawelak [mailto:[EMAIL PROTECTED]
Sent: Monday, May 23, 2005 5:06 PM
To: harmony-dev@incubator.apache.org
Subject: Re: Other interesting papers and research

Other interesting things that can be achieved are some sorts of high
performance "tunning" aspects, which are very interesting, and using gcc
power might be more interesting than redoing it from scratch, either, at
the begining of current project, or maybe forever.
An adequate "bundle" of gcc and harmony might produce a JIT/WAT
java/bytecode compilation. Moreover, the compilation parameters might be
"tuneable" by the JVM administrator and choose between compilation
speed, compilation performance, memory footprint, etc.
Appart from code-reusing, there is also an adequate sort of abstraction
that is good here. and concentrating on this, we avoid discussing
machine level details as we all agree that GCC is portable, performant
and adequate.
Summing up, I support the idea of a java/bytecode to C compiler that can
be bundled with gcc. As stated we would gain portability and we can use
all facilities provided by gcc.

Ariel

Archie Cobbs wrote:

 [EMAIL PROTECTED] wrote:

 The approach of using C Compiler generated code rather than writing a
 full compiler appeals to me:
 http://www.csc.uvic.ca/~csc586a/papers/ertlgregg04.pdf

 I am curious on how well the approach performs compared to existing
 JITs.


 I'm admittedly biased, but the approach of using the C compiler has
 some good benefits, mainly in portability. This is especially true for
 architectures like x86 that have a complicated instruction set, where
 optmization is a subtle art. Though JC uses the C compiler as a WAT
 instead of a JIT, it is very portable (to any architecture that GCC
 targets) as a result. To the extent that portability is a goal, this
 might make sense as an approach to take, at least initially.

 -Archie

 __


 Archie Cobbs  *CTO, Awarix*
 http://www.awarix.com



Re: Other interesting papers and research

2005-05-23 Thread shudo
From: Steve Blackburn [EMAIL PROTECTED]

  [EMAIL PROTECTED] wrote:
 
  The approach of using C Compiler generated code rather than writing a
  full compiler appeals to me:
  http://www.csc.uvic.ca/~csc586a/papers/ertlgregg04.pdf
 
  I am curious on how well the approach performs compared to existing 
  JITs.

 They automatically build themselves
 simple JIT backends (by extracting fragments produced by the ahead of
 time compiler).  This sounds like a great way to achieve portability
 while achiving better performance than a conventional interpreter.

I guess it's a bit better or just comparable with a good interpreter.

In 1998, I have written such a JIT compiler concatinate code fragments
generated by GCC for each JVM instruction. Unfortunately, the JIT was
slightly slower than an interpreter in Sun's Classic VM. The
interpreter was written in x86 assembly language and implements
dynamic stack caching with 2 registers and 3 states. It performs much
better than the previous interpreter written in C.

Then I rewrote the JIT.

I am not very sure which is better for us, having a portable and so-so
baseline compiler or a good interpreter which is possibly less
portable than the compiler. There will be a trade off between memory
consumption, portability and so on.

  Kazuyuki Shudo[EMAIL PROTECTED]   http://www.shudo.net/


Re: Other interesting papers and research

2005-05-22 Thread Archie Cobbs

[EMAIL PROTECTED] wrote:

The approach of using C Compiler generated code rather than writing a
full compiler appeals to me:
http://www.csc.uvic.ca/~csc586a/papers/ertlgregg04.pdf

I am curious on how well the approach performs compared to existing JITs.


I'm admittedly biased, but the approach of using the C compiler has
some good benefits, mainly in portability. This is especially true for
architectures like x86 that have a complicated instruction set, where
optmization is a subtle art. Though JC uses the C compiler as a WAT
instead of a JIT, it is very portable (to any architecture that GCC
targets) as a result. To the extent that portability is a goal, this
might make sense as an approach to take, at least initially.

-Archie

__
Archie Cobbs  *CTO, Awarix*  http://www.awarix.com


Re: Other interesting papers and research

2005-05-22 Thread Steve Blackburn

Archie Cobbs wrote:


[EMAIL PROTECTED] wrote:


The approach of using C Compiler generated code rather than writing a
full compiler appeals to me:
http://www.csc.uvic.ca/~csc586a/papers/ertlgregg04.pdf

I am curious on how well the approach performs compared to existing 
JITs.


I'm admittedly biased, but the approach of using the C compiler has
some good benefits, mainly in portability.


As far as I can tell, the technical insight in this paper has nothing to 
do with C per se. It has to do with having a portable ahead of time 
compiler (be it C or Java).  The idea of leveraging a portable ahead of 
time compiler is something that all interpreters do.  The insight here 
is to do it far more agressively.  They automatically build themselves 
simple JIT backends (by extracting fragments produced by the ahead of 
time compiler).  This sounds like a great way to achieve portability 
while achiving better performance than a conventional interpreter.


So long as we have a portable java WAT compiler at our disposal (gcj), I 
think we can apply this neat idea independant of whether we're using C, 
C++ or Java (or fortran for that matter).


--Steve