Hi Stefan,
Just my idea about an assembler in Scheme. Sounds interesting. If it's
done properly, it can be very promising to use scheme itself to directly
emit machine instructions. This would also be interesting for meta
compilation in the future (think of aiding GCC).
So you are thinking about an assembler for x86? Perhaps I can help out
on this one. I would like to do this part, as I haven't been able to aid
on other parts besides voicing my ideas (anyways, I am on embedded
development these days.)
The only discussion is the syntax I believe, I mean, should it be AT&T
like, Intel like, or leave this domain and do something new. I would go
for instructions like this (using macros):
(let ((target :x686))
(assemble target
((mov long 100 EAX)
(mov long 200 EBX)
(add long EBX EAX))))
Giving back the native machine code instructions. Perhaps special
constructions can be made to return partially complete instructions
(such as missing labels or calls to guile procedures...)
Sjoerd
On 11/12/2012 10:50 PM, Stefan Israelsson Tampe wrote:
Thanks for your mail Noah,
Yea libjit is quite interesting. But playing around with an assembler
in scheme I do not want to go back to
C or C++ land. The only problem is that we need a GNU scheme assembler
and right now I use sbcl's assembler
ported to scheme. We could perhaps use weinholts assembler as well in
industria if he could sign papers to make it GNU. For the register
allocation part I would really like to play a little in scheme to
explore the idea you saw from my previous mail in this thread. Again I
think it's natural to have this features in scheme and do not want to
mess in C land too much.
Am I wrong?
Cheers
Stefan
On Sat, Nov 10, 2012 at 11:49 PM, Noah Lavine <noah.b.lav...@gmail.com
<mailto:noah.b.lav...@gmail.com>> wrote:
Hello,
I assume "compressed native" is the idea you wrote about in your
last email, where we generate native code which is a sequence of
function calls to VM operations.
I really like that idea. As you said, it uses the instruction
cache better. But it also fixes something I was worried about,
which is that it's a lot of work to port an assembler to a new
architecture, so we might end up not supporting many native
architectures. But it seems much easier to make an assembler that
only knows how to make call instructions and branches. So we could
support compressed native on lots of architectures, and maybe
uncompressed native only on some.
If you want a quick way to do compressed native with reasonable
register allocation, GNU libjit might work. I used it a couple
years ago for a JIT project that we never fully implemented. I
chose it over GNU Lightning specifically because it did register
allocation. It implements a full assembler, not just calls, which
could also be nice later.
Noah
On Sat, Nov 10, 2012 at 5:06 PM, Stefan Israelsson Tampe
<stefan.ita...@gmail.com <mailto:stefan.ita...@gmail.com>> wrote:
I would like to continue the discussion about native code.
Some facts are,
For example, consider this
(define (f x) (let loop ((s 0) (i 0)) (if (eq? i x) s (loop (+
s i) (+ i 1)))))
The timings for (f 100000000) ~ (f 100M) is
1) current vm : 2.93s
2) rtl : 1.67s
3) compressed native : 1.15s
4) uncompressed native : 0.54s
sbcl = compressed nativ + better register allocations (normal
optimization level) : 0.68s
To note is that for this example the call overhead is close to
5ns per iteration and meaning that
if we combined 4 with better register handling the potential
is to get this loop to run at 0.2s which means
that the loop has the potential of running 500M iterations in
one second without sacrifying safety and not
have a extraterestial code analyzer. Also to note is that the
native code for the compressed native is smaller then the
rtl code by some factor and if we could make use of registers
in a better way we would end up with even less overhead.
To note is that compressed native is a very simple mechanism
to gain some speed and also improve on memory
usage in the instruction flow, Also the assembler is very
simplistic and it would not be to much hassle to port a new
instruction format to that environment. Also it's probably
possible to handle the complexity of the code in pure C
for the stubs and by compiling them in a special way make sure
they output a format that can be combined
with the meta information in special registers needed to make
the execution of the compiled scheme effective.
This study also shows that there is a clear benefit to be able
to use the computers registers, and I think this is the way
you would like the system to behave in the end. sbcl does this
rather nicely and we could look at their way of doing it.
So, the main question now to you is how to implement the
register allocations? Basic principles of register allocation
can be gotten out from the internet, I'm assure of, but the
problem is how to handle the interaction with the helper
stubs. That is
something i'm not sure of yet.
A simple solution would be to assume that the native code have
a set of available registers r1,...,ri and then force the
compilation of the stubs to treat the just like the registers
bp, sp, and bx. I'm sure that this is possible to configure in
gcc.
So the task for me right now is to find out more how to do
this, if you have any pointers or ideas, please help out.
Cheers
Stefan
On Sat, Nov 10, 2012 at 3:41 PM, Stefan Israelsson Tampe
<stefan.ita...@gmail.com <mailto:stefan.ita...@gmail.com>> wrote:
Hi all,
After talking with Mark Weaver about his view on native
code, I have been pondering how to best model our needs.
I do have a framework now that translates almost all of
the rtl vm directly to native code and it do shows a speed
increase of say 4x compared to runing a rtl VM. I can also
generate rtl code all the way from guile scheme right now
so It's pretty easy to generate test cases. The problem
that Mark point out to is that we need to take care to not
blow the instructuction cache. This is not seen in these
simple examples but we need larger code bases to test out
what is actually true. What we can note though is that I
expect the size of the code to blow up with a factor of
around 10 compared to the instruction feed in the rtl code.
One interesting fact is that SBCL does fairly well by
basically using the native instruction as the instruction
flow to it's VM. For example if it can deduce that a +
operation works with fixnums it simply compiles that as a
function call to a general + routine e.g. it will do a
long jump to the + routine, do the plus, and longjump back
essentially dispatching general instructions like + * /
etc, directly e.g. sbcl do have a virtual machine, it just
don't to table lookup to do the dispatch, but function
call's in stead. If you count longjumps this means that
the number of jumps for these instructions are double that
of using the original table lookup methods. But for
calling functions and returning functions the number of
longjumps are the same and moving local variables in place
, jumping is really fast.
Anyway, this method of dispatching would mean a fairly
small footprint with respect to the direct assembler.
Another big chunk of code that we can speedup without to
much bloat in the instruction cache is the lookup of
pairs, structs and arrays, the reason is that in many
cases we can deduce at compilation so much that we do not
need to check the type all the time but can safely lookup
the needed infromation.
Now is this method fast? well, looking a the sbcl code for
calculating 1+ 2 + 3 + 4 , (disassembling it) I see that
it do uses the mechanism above, it manages to sum 150M
terms in one second, that's quite a feat for a VM with no
JIT. The same with the rtl VM is 65M.
Now, sbcl's compiler is quite matured and uses native
registers quite well which explains one of the reasons why
the speed. My point is though that we can model
efficiently a VM by call's and using the native
instructions and a instructions flow.
Regards Stefan