Hi Sjoerd,

1. I (mov dst src), this is what's used in the rtl vm code in C and I would
like to keep a match between
the two

2. I wouldl like to have  prefix for instruction to mark those out like
    (inst mov dst src)

3. It would be nice to have a default size of the architecture e.g.
   (inst mov quad dst src)  is equivalent to (inst mov dst src)

4. I prefere to have an evironment like
   (assemble target
      (inst mov b a)



On Thu, Nov 15, 2012 at 11:19 AM, Sjoerd van Leent Privé <
svanle...@gmail.com> wrote:

>  Hi Stefan,
>
> Just my idea about an assembler in Scheme. Sounds interesting. If it's
> done properly, it can be very promising to use scheme itself to directly
> emit machine instructions. This would also be interesting for meta
> compilation in the future (think of aiding GCC).
>
> So you are thinking about an assembler for x86? Perhaps I can help out on
> this one. I would like to do this part, as I haven't been able to aid on
> other parts besides voicing my ideas (anyways, I am on embedded development
> these days.)
>
> The only discussion is the syntax I believe, I mean, should it be AT&T
> like, Intel like, or leave this domain and do something new. I would go for
> instructions like this (using macros):
>
> (let ((target :x686))
>   (assemble target
>    ((mov long 100 EAX)
>     (mov long 200 EBX)
>     (add long EBX EAX))))
>
> Giving back the native machine code instructions. Perhaps special
> constructions can be made to return partially complete instructions (such
> as missing labels or calls to guile procedures...)
>
> Sjoerd
>
>
>
> On 11/12/2012 10:50 PM, Stefan Israelsson Tampe wrote:
>
> Thanks for your mail Noah,
>
> Yea libjit is quite interesting. But playing around with an assembler in
> scheme I do not want to go back to
> C or C++ land. The only problem is that we need a GNU scheme assembler and
> right now I use sbcl's assembler
> ported to scheme. We could perhaps use weinholts assembler as well in
> industria if he could sign papers to make it GNU. For the register
> allocation part I would really like to play a little in scheme to explore
> the idea you saw from my previous mail in this thread. Again I think it's
> natural to have this features in scheme and do not want to mess in C land
> too much.
>
> Am I wrong?
>
> Cheers
> Stefan
>
>
> On Sat, Nov 10, 2012 at 11:49 PM, Noah Lavine <noah.b.lav...@gmail.com>wrote:
>
>> Hello,
>>
>>  I assume "compressed native" is the idea you wrote about in your last
>> email, where we generate native code which is a sequence of function calls
>> to VM operations.
>>
>>  I really like that idea. As you said, it uses the instruction cache
>> better. But it also fixes something I was worried about, which is that it's
>> a lot of work to port an assembler to a new architecture, so we might end
>> up not supporting many native architectures. But it seems much easier to
>> make an assembler that only knows how to make call instructions and
>> branches. So we could support compressed native on lots of architectures,
>> and maybe uncompressed native only on some.
>>
>>  If you want a quick way to do compressed native with reasonable
>> register allocation, GNU libjit might work. I used it a couple years ago
>> for a JIT project that we never fully implemented. I chose it over GNU
>> Lightning specifically because it did register allocation. It implements a
>> full assembler, not just calls, which could also be nice later.
>>
>>  Noah
>>
>>
>>
>> On Sat, Nov 10, 2012 at 5:06 PM, Stefan Israelsson Tampe <
>> stefan.ita...@gmail.com> wrote:
>>
>>> I would like to continue the discussion about native code.
>>>
>>> Some facts are,
>>> For example, consider this
>>> (define (f x) (let loop ((s 0) (i 0)) (if (eq? i x) s (loop (+ s i) (+ i
>>> 1)))))
>>>
>>> The timings for (f 100000000)  ~ (f 100M) is
>>>
>>> 1) current vm                 : 2.93s
>>> 2) rtl                              : 1.67s
>>> 3) compressed native     : 1.15s
>>> 4) uncompressed native : 0.54s
>>>
>>> sbcl = compressed nativ + better register allocations (normal
>>> optimization level) : 0.68s
>>>
>>> To note is that for this example the call overhead is close to 5ns per
>>> iteration and meaning that
>>> if we combined 4 with better register handling the potential is to get
>>> this loop to run at 0.2s which means
>>> that the loop has the potential of running 500M iterations in one second
>>> without sacrifying safety and not
>>> have a extraterestial code analyzer. Also to note is that the native
>>> code for the compressed native is smaller then the
>>> rtl code by some factor and if we could make use of registers in a
>>> better way we would end up with even less overhead.
>>>
>>> To note is that compressed native is a very simple mechanism to gain
>>> some speed and also improve on memory
>>> usage in the instruction flow, Also the assembler is very simplistic and
>>> it would not be to much hassle to port a new
>>> instruction format to that environment. Also it's probably possible to
>>> handle the complexity of the code in pure C
>>> for the stubs and by compiling them in a special way make sure they
>>> output a format that can be combined
>>> with the meta information in special registers needed to make the
>>> execution of the compiled scheme effective.
>>>
>>> This study also shows that there is a clear benefit to be able to use
>>> the computers registers, and I think this is the way
>>> you would like the system to behave in the end. sbcl does this rather
>>> nicely and we could look at their way of doing it.
>>>
>>> So, the main question now to you is how to implement the register
>>> allocations? Basic principles of register allocation can be gotten out from
>>> the internet, I'm assure of, but the problem is how to handle the
>>> interaction with the helper stubs. That is
>>> something i'm not sure of yet.
>>>
>>> A simple solution would be to assume that the native code have a set of
>>> available registers r1,...,ri and then force the
>>> compilation of the stubs to treat the just like the registers bp, sp,
>>> and bx. I'm sure that this is possible to configure in gcc.
>>>
>>> So the task for me right now is to find out more how to do this, if you
>>> have any pointers or ideas, please help out.
>>>
>>> Cheers
>>> Stefan
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sat, Nov 10, 2012 at 3:41 PM, Stefan Israelsson Tampe <
>>> stefan.ita...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> After talking with Mark Weaver about his view on native code, I have
>>>> been pondering how to best model our needs.
>>>>
>>>> I do have a framework now that translates almost all of the rtl vm
>>>> directly to native code and it do shows a speed increase of say 4x compared
>>>> to runing a rtl VM. I can also generate rtl code all the way from guile
>>>> scheme right now so It's pretty easy to generate test cases. The problem
>>>> that Mark point out to is that we need to take care to not blow the
>>>> instructuction cache. This is not seen in these simple examples but we need
>>>> larger code bases to test out what is actually true. What we can note
>>>> though is that I expect the size of the code to blow up with a factor of
>>>> around 10 compared to the instruction feed in the rtl code.
>>>>
>>>> One interesting fact is that SBCL does fairly well by basically using
>>>> the native instruction as the instruction flow to it's VM. For example if
>>>> it can deduce that a + operation works with fixnums it simply compiles that
>>>> as a function call to a general + routine e.g. it will do a long jump to
>>>> the + routine, do the plus, and longjump back essentially dispatching
>>>> general instructions like + * / etc, directly e.g. sbcl do have a virtual
>>>> machine, it just don't to table lookup to do the dispatch, but function
>>>> call's in stead. If you count longjumps this means that the number of jumps
>>>> for these instructions are double that of using the original table lookup
>>>> methods. But for calling functions and returning functions the number of
>>>> longjumps are the same and moving local variables in place , jumping  is
>>>> really fast.
>>>>
>>>> Anyway, this method of dispatching would mean a fairly small footprint
>>>> with respect to the direct assembler. Another big chunk of code that we can
>>>> speedup without to much bloat in the instruction cache is the lookup of
>>>> pairs, structs and arrays, the reason is that in many cases we can deduce
>>>> at compilation so much that we do not need to check the type all the time
>>>> but can safely lookup the needed infromation.
>>>>
>>>> Now is this method fast? well, looking a the sbcl code for calculating
>>>> 1+ 2 + 3 + 4 , (disassembling it) I see that it do uses the mechanism
>>>> above, it manages to sum 150M terms in one second, that's quite a feat for
>>>> a VM with no JIT. The same with the rtl VM is 65M.
>>>>
>>>> Now, sbcl's compiler is quite matured and uses native registers quite
>>>> well which explains one of the reasons why the speed. My point is though
>>>> that we can model efficiently a VM by call's and using the native
>>>> instructions and a instructions flow.
>>>>
>>>> Regards Stefan
>>>>
>>>
>>>
>>
>
>

Reply via email to