Re: compile-rtl
On Sat 26 Jan 2013 13:28, Stefan Israelsson Tampe stefan.ita...@gmail.com writes: Andy Wingo wi...@pobox.com writes: Now when we are tail-calling in rtl, we just fill the argument slots with the new function arguments directly and then tail-call by filling in number of arguments and function. This is very smart and just some simple features added would mean that a lot of translation from one memory location to the other is skipped. I really like how the rtl code handle this but there is an expense. It complicates life a little when we overwrite arguments that are used in the calculation of other arguments. I'm working on how to handle this but I just wanted to point out how nice this design choice is. Thanks! In general at the end you have a parallel register move (or parallel assignment), which has a number of standard solutions out there. This is quite a natural first step. But especially for loops there is a similar tail pattern that probably needs to be optimized better w.r.t. register slots when we compile nativly There are two things that can help here: (1) Tight allocation of registers (VM or native). Registers should be allocated not just in terms of scope but in terms of where they are used -- after their last use they are dead and so the register is available for some other purpose. This can allow the loop variables in a loop can be updated in-place, without shuffling. (2) For the native case, native register allocation. I've worked with linear scan before and it seems pretty good, and we could do it directly on the CPS form; Wimmer et al are able to do it on SSA form, so I don't see why not. Andy -- http://wingolog.org/
Re: compile-rtl
Andy Wingo wi...@pobox.com writes: Now when we are tail-calling in rtl, we just fill the argument slots with the new function arguments directly and then tail-call by filling in number of arguments and function. This is very smart and just some simple features added would mean that a lot of translation from one memory location to the other is skipped. I really like how the rtl code handle this but there is an expense. It complicates life a little when we overwrite arguments that are used in the calculation of other arguments. I'm working on how to handle this but I just wanted to point out how nice this design choice is. Thanks! In general at the end you have a parallel register move (or parallel assignment), which has a number of standard solutions out there. This is quite a natural first step. But especially for loops there is a similar tail pattern that probably needs to be optimized better w.r.t. register slots when we compile nativly /Stefan
Re: compile-rtl
On Sun 14 Oct 2012 17:13, Stefan Israelsson Tampe stefan.ita...@gmail.com writes: potential memory leaks. To let it be as it is designed right now, will mean that it is very difficult looking at the scheme code to see what will be protected from gc or not. Explicit clearing is much better than a stack pointer. We can be much smarter about it. Now when we are tail-calling in rtl, we just fill the argument slots with the new function arguments directly and then tail-call by filling in number of arguments and function. This is very smart and just some simple features added would mean that a lot of translation from one memory location to the other is skipped. I really like how the rtl code handle this but there is an expense. It complicates life a little when we overwrite arguments that are used in the calculation of other arguments. I'm working on how to handle this but I just wanted to point out how nice this design choice is. Thanks! In general at the end you have a parallel register move (or parallel assignment), which has a number of standard solutions out there. call's, here currently we send a list of register positions to the call function and the call function itself will copy those to the argument fields. This is the opposite of the tail-call, in a case like (f 1 3) you will create two temporary variables and copy them later on to the argument position. Here I would like to change the semantic so that we fill in the arguments directly just like in the tail-call. We can't do this unless we reintroduce a stack pointer, I don't think; and anyway reopening the hole for the in-progress frame is pretty nasty for the runtime (the backtrace code in particular). Dunno. I'm hesitant to change this. Also, I would like to use the end of the stack region to compute the function frame. Well even if we use the end of the full frame we will save one move per argument which is nice. This is possible I guess, but again makes tooling a bit difficult. 4. Return values. We talked about this before and there is some concerns how to make this fast. I would like to skip the complication by simply put the return values in the argument positions to the function. Also the number of arguments is mediated to the reciever. I would like to skip the mv-return slot Have you been working off of wip-rtl after the commit, remove return_loc, instead put mv returns on the stack ? Does that not solve the issue for you? After that commit, there will be no more MVRA (though it is still there while the VM is in a transitional state). Andy -- http://wingolog.org/
Re: compile-rtl, II
On Mon 15 Oct 2012 16:05, Stefan Israelsson Tampe stefan.ita...@gmail.com writes: (arg1, return-address, old-frame, program, arg2 ...) This way we do not need to copy the program field at a tail call What would you do for 0 arguments? Also another kind of difficulty will arise, many instructions only take a value of 256 register positions Do you have examples? I think the idea I had in mind was to reserve the last two or three of the 8-bit-addressable registers as temporaries, and emit long-mov shuffles to handle those cases. Andy -- http://wingolog.org/
Re: compile-rtl, II
On Mon, Jan 21, 2013 at 6:39 PM, Andy Wingo wi...@pobox.com wrote: On Mon 15 Oct 2012 16:05, Stefan Israelsson Tampe stefan.ita...@gmail.com writes: (arg1, return-address, old-frame, program, arg2 ...) This way we do not need to copy the program field at a tail call What would you do for 0 arguments? I would just fill in the first argument with SCM_UNSPECIFIED or such and send information to the call that 0 argumnets is filled in, then the compiled code would work smartly and use this field for normal rtl registers. Ayway I did not pursue this at date more than adding some flags to the compiler so that we can turn on this feature and try it out later. Also another kind of difficulty will arise, many instructions only take a value of 256 register positions Do you have examples? I think the idea I had in mind was to reserve the last two or three of the 8-bit-addressable registers as temporaries, and emit long-mov shuffles to handle those cases. To be frank it looks like 95% of the code will never need to think about this issue, so your approach seams sane. Andy -- http://wingolog.org/
compile-rtl
Hi, I'm right now trying to port compile-glil to compile-rtl and I would say that what's hindering me is what design choices to take? 1. The main problem is that we do not have a stack pointer the same way as before. Of cause we still need to store temporary values and we will simply have to use predefined local register indices, that's not hard to implement though. The problem is that we will have memory not cleared that's not going to be GC:ed until the frame is changed. This will cause some potential memory leaks. To let it be as it is designed right now, will mean that it is very difficult looking at the scheme code to see what will be protected from gc or not. And I fear that we will need to handle this in some smart way. The solutions are either, keep a stack pointer in a register and updated it for each push and pop as before. Or do what we do after let forms, clear the variable slots - expensive! What I think is that we need to have the equivalent of a stack pointer in order to properly handle gc'ing. We could still have a register approach and gain quite some speed bump (2x) from using direct addressing in stead of using the middle layer of the stack for everything. My vote is introduce a register sp again! 2. Now when we are tail-calling in rtl, we just fill the argument slots with the new function arguments directly and then tail-call by filling in number of arguments and function. This is very smart and just some simple features added would mean that a lot of translation from one memory location to the other is skipped. I really like how the rtl code handle this but there is an expense. It complicates life a little when we overwrite arguments that are used in the calculation of other arguments. I'm working on how to handle this but I just wanted to point out how nice this design choice is. 3. call's, here currently we send a list of register positions to the call function and the call function itself will copy those to the argument fields. This is the opposite of the tail-call, in a case like (f 1 3) you will create two temporary variables and copy them later on to the argument position. Here I would like to change the semantic so that we fill in the arguments directly just like in the tail-call. Also, I would like to use the end of the stack region to compute the function frame. Well even if we use the end of the full frame we will save one move per argument which is nice. 4. Return values. We talked about this before and there is some concerns how to make this fast. I would like to skip the complication by simply put the return values in the argument positions to the function. Also the number of arguments is mediated to the reciever. I would like to skip the mv-return slot and add a rmove function like, (call pos proc) (rmove pos (i ...)) And it is rmove's responsibility to move the arguments to it's return positions, also it's possible for functions calling functions to just clear the function slot and keep the values for later use by simply increasing the stack to contain the return value(s). this means that we can keep the copying to the minimal (which is one of the big costs). Also keeping it this way lead to quite some smooth handling of code that evaluates a function on the return values, one just need to fill in the function and return ip and evaluate. cool. The downside is for native code where we may transfer things back in computer registers. But the overhead of function call's are of such dignity that the copying of one value is not of importance even for natively compiled functions, The upfront cost of handling functions is pretty high. Note that the improvement's that I suggest have two properties, 1. they are composable, 2. they scale well in the number of arguments and return values. My feeling is that speed bumps due to function overhead can be adressed by a) Inlining b) A well designed interface of user defined instructions c) Develop separate function call mechanisms that a compiler can deduce and use. WDYT /Stefan