Hi Sjoerd, 1. I (mov dst src), this is what's used in the rtl vm code in C and I would like to keep a match between the two
2. I wouldl like to have prefix for instruction to mark those out like (inst mov dst src) 3. It would be nice to have a default size of the architecture e.g. (inst mov quad dst src) is equivalent to (inst mov dst src) 4. I prefere to have an evironment like (assemble target (inst mov b a) On Thu, Nov 15, 2012 at 11:19 AM, Sjoerd van Leent Privé < svanle...@gmail.com> wrote: > Hi Stefan, > > Just my idea about an assembler in Scheme. Sounds interesting. If it's > done properly, it can be very promising to use scheme itself to directly > emit machine instructions. This would also be interesting for meta > compilation in the future (think of aiding GCC). > > So you are thinking about an assembler for x86? Perhaps I can help out on > this one. I would like to do this part, as I haven't been able to aid on > other parts besides voicing my ideas (anyways, I am on embedded development > these days.) > > The only discussion is the syntax I believe, I mean, should it be AT&T > like, Intel like, or leave this domain and do something new. I would go for > instructions like this (using macros): > > (let ((target :x686)) > (assemble target > ((mov long 100 EAX) > (mov long 200 EBX) > (add long EBX EAX)))) > > Giving back the native machine code instructions. Perhaps special > constructions can be made to return partially complete instructions (such > as missing labels or calls to guile procedures...) > > Sjoerd > > > > On 11/12/2012 10:50 PM, Stefan Israelsson Tampe wrote: > > Thanks for your mail Noah, > > Yea libjit is quite interesting. But playing around with an assembler in > scheme I do not want to go back to > C or C++ land. The only problem is that we need a GNU scheme assembler and > right now I use sbcl's assembler > ported to scheme. We could perhaps use weinholts assembler as well in > industria if he could sign papers to make it GNU. For the register > allocation part I would really like to play a little in scheme to explore > the idea you saw from my previous mail in this thread. Again I think it's > natural to have this features in scheme and do not want to mess in C land > too much. > > Am I wrong? > > Cheers > Stefan > > > On Sat, Nov 10, 2012 at 11:49 PM, Noah Lavine <noah.b.lav...@gmail.com>wrote: > >> Hello, >> >> I assume "compressed native" is the idea you wrote about in your last >> email, where we generate native code which is a sequence of function calls >> to VM operations. >> >> I really like that idea. As you said, it uses the instruction cache >> better. But it also fixes something I was worried about, which is that it's >> a lot of work to port an assembler to a new architecture, so we might end >> up not supporting many native architectures. But it seems much easier to >> make an assembler that only knows how to make call instructions and >> branches. So we could support compressed native on lots of architectures, >> and maybe uncompressed native only on some. >> >> If you want a quick way to do compressed native with reasonable >> register allocation, GNU libjit might work. I used it a couple years ago >> for a JIT project that we never fully implemented. I chose it over GNU >> Lightning specifically because it did register allocation. It implements a >> full assembler, not just calls, which could also be nice later. >> >> Noah >> >> >> >> On Sat, Nov 10, 2012 at 5:06 PM, Stefan Israelsson Tampe < >> stefan.ita...@gmail.com> wrote: >> >>> I would like to continue the discussion about native code. >>> >>> Some facts are, >>> For example, consider this >>> (define (f x) (let loop ((s 0) (i 0)) (if (eq? i x) s (loop (+ s i) (+ i >>> 1))))) >>> >>> The timings for (f 100000000) ~ (f 100M) is >>> >>> 1) current vm : 2.93s >>> 2) rtl : 1.67s >>> 3) compressed native : 1.15s >>> 4) uncompressed native : 0.54s >>> >>> sbcl = compressed nativ + better register allocations (normal >>> optimization level) : 0.68s >>> >>> To note is that for this example the call overhead is close to 5ns per >>> iteration and meaning that >>> if we combined 4 with better register handling the potential is to get >>> this loop to run at 0.2s which means >>> that the loop has the potential of running 500M iterations in one second >>> without sacrifying safety and not >>> have a extraterestial code analyzer. Also to note is that the native >>> code for the compressed native is smaller then the >>> rtl code by some factor and if we could make use of registers in a >>> better way we would end up with even less overhead. >>> >>> To note is that compressed native is a very simple mechanism to gain >>> some speed and also improve on memory >>> usage in the instruction flow, Also the assembler is very simplistic and >>> it would not be to much hassle to port a new >>> instruction format to that environment. Also it's probably possible to >>> handle the complexity of the code in pure C >>> for the stubs and by compiling them in a special way make sure they >>> output a format that can be combined >>> with the meta information in special registers needed to make the >>> execution of the compiled scheme effective. >>> >>> This study also shows that there is a clear benefit to be able to use >>> the computers registers, and I think this is the way >>> you would like the system to behave in the end. sbcl does this rather >>> nicely and we could look at their way of doing it. >>> >>> So, the main question now to you is how to implement the register >>> allocations? Basic principles of register allocation can be gotten out from >>> the internet, I'm assure of, but the problem is how to handle the >>> interaction with the helper stubs. That is >>> something i'm not sure of yet. >>> >>> A simple solution would be to assume that the native code have a set of >>> available registers r1,...,ri and then force the >>> compilation of the stubs to treat the just like the registers bp, sp, >>> and bx. I'm sure that this is possible to configure in gcc. >>> >>> So the task for me right now is to find out more how to do this, if you >>> have any pointers or ideas, please help out. >>> >>> Cheers >>> Stefan >>> >>> >>> >>> >>> >>> >>> On Sat, Nov 10, 2012 at 3:41 PM, Stefan Israelsson Tampe < >>> stefan.ita...@gmail.com> wrote: >>> >>>> Hi all, >>>> >>>> After talking with Mark Weaver about his view on native code, I have >>>> been pondering how to best model our needs. >>>> >>>> I do have a framework now that translates almost all of the rtl vm >>>> directly to native code and it do shows a speed increase of say 4x compared >>>> to runing a rtl VM. I can also generate rtl code all the way from guile >>>> scheme right now so It's pretty easy to generate test cases. The problem >>>> that Mark point out to is that we need to take care to not blow the >>>> instructuction cache. This is not seen in these simple examples but we need >>>> larger code bases to test out what is actually true. What we can note >>>> though is that I expect the size of the code to blow up with a factor of >>>> around 10 compared to the instruction feed in the rtl code. >>>> >>>> One interesting fact is that SBCL does fairly well by basically using >>>> the native instruction as the instruction flow to it's VM. For example if >>>> it can deduce that a + operation works with fixnums it simply compiles that >>>> as a function call to a general + routine e.g. it will do a long jump to >>>> the + routine, do the plus, and longjump back essentially dispatching >>>> general instructions like + * / etc, directly e.g. sbcl do have a virtual >>>> machine, it just don't to table lookup to do the dispatch, but function >>>> call's in stead. If you count longjumps this means that the number of jumps >>>> for these instructions are double that of using the original table lookup >>>> methods. But for calling functions and returning functions the number of >>>> longjumps are the same and moving local variables in place , jumping is >>>> really fast. >>>> >>>> Anyway, this method of dispatching would mean a fairly small footprint >>>> with respect to the direct assembler. Another big chunk of code that we can >>>> speedup without to much bloat in the instruction cache is the lookup of >>>> pairs, structs and arrays, the reason is that in many cases we can deduce >>>> at compilation so much that we do not need to check the type all the time >>>> but can safely lookup the needed infromation. >>>> >>>> Now is this method fast? well, looking a the sbcl code for calculating >>>> 1+ 2 + 3 + 4 , (disassembling it) I see that it do uses the mechanism >>>> above, it manages to sum 150M terms in one second, that's quite a feat for >>>> a VM with no JIT. The same with the rtl VM is 65M. >>>> >>>> Now, sbcl's compiler is quite matured and uses native registers quite >>>> well which explains one of the reasons why the speed. My point is though >>>> that we can model efficiently a VM by call's and using the native >>>> instructions and a instructions flow. >>>> >>>> Regards Stefan >>>> >>> >>> >> > >