Hi Stefan,

Just my idea about an assembler in Scheme. Sounds interesting. If it's done properly, it can be very promising to use scheme itself to directly emit machine instructions. This would also be interesting for meta compilation in the future (think of aiding GCC).

So you are thinking about an assembler for x86? Perhaps I can help out on this one. I would like to do this part, as I haven't been able to aid on other parts besides voicing my ideas (anyways, I am on embedded development these days.)

The only discussion is the syntax I believe, I mean, should it be AT&T like, Intel like, or leave this domain and do something new. I would go for instructions like this (using macros):

(let ((target :x686))
  (assemble target
   ((mov long 100 EAX)
    (mov long 200 EBX)
    (add long EBX EAX))))

Giving back the native machine code instructions. Perhaps special constructions can be made to return partially complete instructions (such as missing labels or calls to guile procedures...)

Sjoerd


On 11/12/2012 10:50 PM, Stefan Israelsson Tampe wrote:
Thanks for your mail Noah,

Yea libjit is quite interesting. But playing around with an assembler in scheme I do not want to go back to C or C++ land. The only problem is that we need a GNU scheme assembler and right now I use sbcl's assembler ported to scheme. We could perhaps use weinholts assembler as well in industria if he could sign papers to make it GNU. For the register allocation part I would really like to play a little in scheme to explore the idea you saw from my previous mail in this thread. Again I think it's natural to have this features in scheme and do not want to mess in C land too much.

Am I wrong?

Cheers
Stefan


On Sat, Nov 10, 2012 at 11:49 PM, Noah Lavine <noah.b.lav...@gmail.com <mailto:noah.b.lav...@gmail.com>> wrote:

    Hello,

    I assume "compressed native" is the idea you wrote about in your
    last email, where we generate native code which is a sequence of
    function calls to VM operations.

    I really like that idea. As you said, it uses the instruction
    cache better. But it also fixes something I was worried about,
    which is that it's a lot of work to port an assembler to a new
    architecture, so we might end up not supporting many native
    architectures. But it seems much easier to make an assembler that
    only knows how to make call instructions and branches. So we could
    support compressed native on lots of architectures, and maybe
    uncompressed native only on some.

    If you want a quick way to do compressed native with reasonable
    register allocation, GNU libjit might work. I used it a couple
    years ago for a JIT project that we never fully implemented. I
    chose it over GNU Lightning specifically because it did register
    allocation. It implements a full assembler, not just calls, which
    could also be nice later.

    Noah



    On Sat, Nov 10, 2012 at 5:06 PM, Stefan Israelsson Tampe
    <stefan.ita...@gmail.com <mailto:stefan.ita...@gmail.com>> wrote:

        I would like to continue the discussion about native code.

        Some facts are,
        For example, consider this
        (define (f x) (let loop ((s 0) (i 0)) (if (eq? i x) s (loop (+
        s i) (+ i 1)))))

        The timings for (f 100000000)  ~ (f 100M) is

        1) current vm                 : 2.93s
        2) rtl                              : 1.67s
        3) compressed native     : 1.15s
        4) uncompressed native : 0.54s

        sbcl = compressed nativ + better register allocations (normal
        optimization level) : 0.68s

        To note is that for this example the call overhead is close to
        5ns per iteration and meaning that
        if we combined 4 with better register handling the potential
        is to get this loop to run at 0.2s which means
        that the loop has the potential of running 500M iterations in
        one second without sacrifying safety and not
        have a extraterestial code analyzer. Also to note is that the
        native code for the compressed native is smaller then the
        rtl code by some factor and if we could make use of registers
        in a better way we would end up with even less overhead.

        To note is that compressed native is a very simple mechanism
        to gain some speed and also improve on memory
        usage in the instruction flow, Also the assembler is very
        simplistic and it would not be to much hassle to port a new
        instruction format to that environment. Also it's probably
        possible to handle the complexity of the code in pure C
        for the stubs and by compiling them in a special way make sure
        they output a format that can be combined
        with the meta information in special registers needed to make
        the execution of the compiled scheme effective.

        This study also shows that there is a clear benefit to be able
        to use the computers registers, and I think this is the way
        you would like the system to behave in the end. sbcl does this
        rather nicely and we could look at their way of doing it.

        So, the main question now to you is how to implement the
        register allocations? Basic principles of register allocation
        can be gotten out from the internet, I'm assure of, but the
        problem is how to handle the interaction with the helper
        stubs. That is
        something i'm not sure of yet.

        A simple solution would be to assume that the native code have
        a set of available registers r1,...,ri and then force the
        compilation of the stubs to treat the just like the registers
        bp, sp, and bx. I'm sure that this is possible to configure in
        gcc.

        So the task for me right now is to find out more how to do
        this, if you have any pointers or ideas, please help out.

        Cheers
        Stefan






        On Sat, Nov 10, 2012 at 3:41 PM, Stefan Israelsson Tampe
        <stefan.ita...@gmail.com <mailto:stefan.ita...@gmail.com>> wrote:

            Hi all,

            After talking with Mark Weaver about his view on native
            code, I have been pondering how to best model our needs.

            I do have a framework now that translates almost all of
            the rtl vm directly to native code and it do shows a speed
            increase of say 4x compared to runing a rtl VM. I can also
            generate rtl code all the way from guile scheme right now
            so It's pretty easy to generate test cases. The problem
            that Mark point out to is that we need to take care to not
            blow the instructuction cache. This is not seen in these
            simple examples but we need larger code bases to test out
            what is actually true. What we can note though is that I
            expect the size of the code to blow up with a factor of
            around 10 compared to the instruction feed in the rtl code.

            One interesting fact is that SBCL does fairly well by
            basically using the native instruction as the instruction
            flow to it's VM. For example if it can deduce that a +
            operation works with fixnums it simply compiles that as a
            function call to a general + routine e.g. it will do a
            long jump to the + routine, do the plus, and longjump back
            essentially dispatching general instructions like + * /
            etc, directly e.g. sbcl do have a virtual machine, it just
            don't to table lookup to do the dispatch, but function
            call's in stead. If you count longjumps this means that
            the number of jumps for these instructions are double that
            of using the original table lookup methods. But for
            calling functions and returning functions the number of
            longjumps are the same and moving local variables in place
            , jumping  is really fast.

            Anyway, this method of dispatching would mean a fairly
            small footprint with respect to the direct assembler.
            Another big chunk of code that we can speedup without to
            much bloat in the instruction cache is the lookup of
            pairs, structs and arrays, the reason is that in many
            cases we can deduce at compilation so much that we do not
            need to check the type all the time but can safely lookup
            the needed infromation.

            Now is this method fast? well, looking a the sbcl code for
            calculating 1+ 2 + 3 + 4 , (disassembling it) I see that
            it do uses the mechanism above, it manages to sum 150M
            terms in one second, that's quite a feat for a VM with no
            JIT. The same with the rtl VM is 65M.

            Now, sbcl's compiler is quite matured and uses native
            registers quite well which explains one of the reasons why
            the speed. My point is though that we can model
            efficiently a VM by call's and using the native
            instructions and a instructions flow.

            Regards Stefan





Reply via email to