Re: Register Allocation and Continuations problem definition
I hope that no one misunderstands when I say this. It is not meant to be a criticism, but rather it is my understanding of the problem, put very directly. I may be wrong, and in many ways, I hope am. There is a fundamental flaw in the Parrot architecture. In it's current form, continuations and register allocation cannot coexist. The solution is to do away with the register allocator, or to do away with continuations. This can be done, perhaps by saying that certain functions won't be subject to continuation snapshots or calls. Then register allocation can be done on those routines. Or compiling the whole program with a switch of on or off. When I say do away with the register allocator, I mean that nearly all registers will interfere, so essentially each variable gets a register. The extensible register stack is the best way to implement this strategy. If there are long sections of code with lots of temporary variables, then register allocation might be of some use. But most variables will interfere. Implementing this would be an interesting project. It would be quite different from a standard register allocator. It might even be worth writing a paper on the techniques involved. Turning off full continuations, is a another matter. I don't know what impact that would have on parrot. It would be relatively straightforward to write the allocator with this approach. There are other issues involved, regarding how registers change when called. What is preserved, and how. Matt touched on some of these issues. My main concearn in attempting to implement the register allocator was to find the minimal control flow graph (cfg) which helps to provide the variable interference graph. This is my understanding of the problem. Matt wants this thread to be a problem definition, so these thoughts seem appropriate for this thread. I hope no one is offended by my bluntness, but if this problem is to be solved, it must be understood. Bill Coffman ps: a couple important threads on this issue were http://groups-beta.google.com/group/perl.perl6.internals/browse_frm/thread/3a61fe5e97d17389/0603ff13ca52f0ff?_done=%2Fgroup%2Fperl.perl6.internals%3F_doneTitle=Back+to+topics_doneTitle=Backd#0603ff13ca52f0ff http://groups-beta.google.com/group/perl.perl6.internals/browse_frm/thread/214157bf29879710/4c15aa4a1f56c20a?_done=%2Fgroup%2Fperl.perl6.internals%3F_doneTitle=Back+to+topics_doneTitle=Backd#4c15aa4a1f56c20a On 6/3/05, Matt Fowles [EMAIL PROTECTED] wrote: All~ On 6/3/05, Bill Coffman [EMAIL PROTECTED] wrote: There are several threads in the parrot mailing list that discuss the continuations problem. I hoped to be able to address parrot register allocator again, at some point, but it could be a while before I get to that. In the mean time, search the perl6-internals mailing list for continuations and register allocation. I'm sure you'll find it. I would actually very much like to see this issue moved on. In the hopes of getting everything up to speed quickly I will try and summarize all of the discussions and suggestions that have been made. I am NOT trying to advocate any particular solution (yet), I just want to make sure that everyone is on the same page. I propose that we use this thread to ensure that we are not talking past each other. Discussions about specific solutions should and religion should have a seperate thread. PROBLEM: Continuations can be invoked repeatedly. Whenever a continuation is reinvoked, the registers must be in a defined state or the code will behave incorrectly. BASIC ANALYSIS The actual state of the registers does not matter as long as it is well defined. A) If the registers are defined to be garbage, then every continuation will start by returning them to the state it expects. B) If the registers are defined to be preserved, then every continuation merrily chugs along. Unfortunately these each have issues and as such much debate ensued. SUGGESTED SOLUTIONS 1) Context includes registers + continuations don't have to preserve specifc registers + register allocator can ignore continuations wrt spilling - copying overheard - value based registers (IN) return to old values - pointer based registers (SP) cannot have their pointers moved or require double indirection 2) Return Continuations include registers, non don't + register allocator can remain mostly ignorant + only non return continuations need to worry - promoting a return continuation will force copying as plan (1) 3) register are not restored + simple to explain - register allocator must add many extra interference edges - largely prevents reuse of register 4) variable sized register frames + never have unused registers + no need for a register allocator + could correspond directly to scratchpad - more complicated - no register reuse - large architectural change - more custom allocation (can't pool
Re: register allocation
Bill Coffman [EMAIL PROTECTED] wrote: Hello, * I have the below failed tests. t/library/dumper.t 13 332813 13 100.00% 1-13 These are fixed (assignment collisions in pcc.c) t/op/gc.t 1 256181 5.56% 13 Is very similar to the case below - gc_14 (you might read that one first, it's a bit simpler). The variable Carr2 is assigned to P17. In gc_13.imc:80 the temp C$P0 is also getting this register. Now when the algorithm backtracks over y = choose(arr2) arr2 in P17 is clobbered and we get shift_pmc() not implemented in class 'Closure'. (You can turn on these print statements to see the flow of operations) There is no sign in the source code, that the control flow from calling choose() the second time to calling the fail closure might be a branch target. We basically have a loop (due to the continuation usage), which the register allocator can't be aware of. Inserting a label e.g. lp: y = choose(arr2) and a branch lp after the call to the fail closure hides the problem, because with that loop, Carr2 is preserved. (arr1 is still clobbered, but the algorithm doesn't bactrack there again). t/pmc/sub.t 1 256781 1.28% 78 tb_loop: set P17, P16[K16] ^ Somewhere a SymReg with set 'K' has slipped into the register allocator. Excluding r-set != 'K' in build_reglist():405 fixes that. Now failing too is t/op/gc_14 (and an exception test, which is the same). These 2 are a bit tricky and I don't know, which result is correct - or better there is probably no answer. We have this recursive method: sub b11 method .param int n .local int n1 n1 = n + 1 newsub $P0, .Exception_Handler, catch set_eh $P0 n = self.b11(n1) clear_eh catch: .pcc_begin_return .return n .pcc_end_return .end Cn1 is going into the recursive call, Cn is the return result. Now it depends on the register allocator which register Cn1 and Cn have. When the allocator isn't really clever then it assigns different registers. So Cn is preserved when the exception is fired as the recursion limit is hit. The new allocator assigns both Cn1 and Cn to the same Parrot register. This is technically correct, as the life-span of Cn ends on the line where it's used as a return result of the recursive method call. Therefore we have a piece of code that in the presence of exceptions may behave differently depending on some implementation details. This is AFAIK very the same, as the gcc warning we had until a few days about a possibly clobbered variable used around setjmp (s. src/inter_run.c:55, and the variable offset. (It's a volatile now) This is totally the same variable usage like above's Cn and the result depends on register allocation details. ~Bill leo
Re: register allocation
Bill Coffman wrote: Hello, * I have the below failed tests. I haven't looked into these. Can someone tell me if the tests are broken, or is my allocator broken. I know they don't fail for the current cvs code (as of yesterday). Failed TestStat Wstat Total Fail Failed List of Failed --- t/library/dumper.t 13 332813 13 100.00% 1-13 t/op/gc.t 1 256181 5.56% 13 t/pmc/sub.t 1 256781 1.28% 78 These errors look remarkable the same, when I turn on OPT_SUB that is, when allocate_wanted_regs() is used. And this code did miss registers sets like 'K' (compound keys). Changing imcc/reg_alloc.c:890 to: if (r-color = 0 || r-want_regno == -1 || strchr(ISPN, r-set == 0)) did fix this flaw. leo
Re: register allocation
Interesting. I did a little more research, and checked out a fresh CVS tree. I did nothing more than insert a conflict checker into reg_alloc.c That is, whatever pre-colored registers have been assigned to symbols, I verify that none of them are conflicting. Guess what I got? make[1]: Leaving directory `/usr/home/coffman/dev/parrot/parrot/docs' ./parrot -o runtime/parrot/library/Data/Dumper/Base.pbc runtime/parrot/library/Data/Dumper/Base.imc ** added new here node 1 = defname(S) is colored 5 and neighbor 9 = S5(S) is colored 5 parrot: imcc/reg_alloc.c:200: imc_reg_alloc: Assertion `r-color==-1 || r-color != unit-reglist[y]-color' failed. make: *** [runtime/parrot/library/Data/Dumper/Base.pbc] Aborted The routine you mentioned, allocate_wanted_regs() doesn't assign wanted regs if in conflict, but as far as I can tell, there's no check on the integrity of precolored nodes. Not quite sure what the correct behaviour is for allocator, when it's given conflicts. I saw 'K' compounded key somewhere, but don't know what it is. I know reg_alloc.c pretty well though. Offhand I'd say it's not handled in reg_alloc. Maybe it should be? ~Bill On Fri, 12 Nov 2004 10:26:51 +0100, Leopold Toetsch [EMAIL PROTECTED] wrote: Bill Coffman wrote: Hello, * I have the below failed tests. I haven't looked into these. Can someone tell me if the tests are broken, or is my allocator broken. I know they don't fail for the current cvs code (as of yesterday). Failed TestStat Wstat Total Fail Failed List of Failed --- t/library/dumper.t 13 332813 13 100.00% 1-13 t/op/gc.t 1 256181 5.56% 13 t/pmc/sub.t 1 256781 1.28% 78 These errors look remarkable the same, when I turn on OPT_SUB that is, when allocate_wanted_regs() is used. And this code did miss registers sets like 'K' (compound keys). Changing imcc/reg_alloc.c:890 to: if (r-color = 0 || r-want_regno == -1 || strchr(ISPN, r-set == 0)) did fix this flaw. leo
Re: register allocation
These errors look remarkable the same, when I turn on OPT_SUB that is, when allocate_wanted_regs() is used. And this code did miss registers sets like 'K' (compound keys). Changing imcc/reg_alloc.c:890 to: if (r-color = 0 || r-want_regno == -1 || strchr(ISPN, r-set == 0)) did fix this flaw. leo You're patch looks suspiciously like a typo. strchr(ISPN, r-set == 0)) will always returns true if r-set != 0. This means that the patch effectively disables allocate_wanted_regs(), unless there are certain cases where r-set==0, which I doubt. Try sticking this toward the beginning of imc_reg_alloc(), but after build_interference_graph, like imcc/reg_alloc.c:187 (or so) #ifndef NDEBUG { int x,y; for (x = 0; x unit-n_symbols; x++) { SymReg* r = unit-reglist[x]; for (y = 0; y unit-n_symbols; y++) { if (ig_test(x, y, unit-n_symbols, unit-interference_graph)) { if(r-color!=-1 r-color == unit-reglist[y]-color) fprintf(stderr,node %d = %s(%c) is colored %d and neighbor %d = %s(%c) is colored %d\n, x,r-name,r-set,r-color, y,unit-reglist[y]-name,unit-reglist[y]-set, unit-reglist[y]-color); assert(r-color==-1 || r-color != unit-reglist[y]-color); } } } } #endif And before ... while (todo) {
Re: Register allocation/spilling and register volitility
On Mon, 8 Nov 2004 12:20:19 +0100, Leopold Toetsch [EMAIL PROTECTED] wrote: Jeff Clites [EMAIL PROTECTED] wrote: On Nov 8, 2004, at 1:34 AM, Leopold Toetsch wrote: If frames aren't adjacent, normal argument copying can be done anyway. This would seem to require the same types of runtime checks that you are objecting to below, Not runtime. The register allocator knows the amount of needed non-volatiles and volatile registers. Depending on that a call can place arguments directly into the incoming regs of the caller or not. Then if at runtime copying is needed (because e.g. frames aren't adjacent), the function arguments are copyied. This is fully transparent. I discussed that in item (1) under Further notes. Yep, saw that then ;) It's not needed. I've a better scheme in mind, which addressess efficieny as well as argument passing. And spilling? Well, I'm proposing a variable-sized register frame. With very little additions we could run with more then 32 registers per kind (there are a few bitmasks currently that would need adaptions, but not much). I think this is a great idea. Mostly, I was thinking of the case where there are less than 32 registers needed. The allocator can try to reduce the number of registers that will need to be used. I think that addressing could become an issue is there are too many registers, but I still haven't figured out how the bytecode looks yet. But basically, spilling should not be needed at all, if the register allocator isn't as broken as it now is. Dan got some really evil subroutines, so we have RL test cases. We'll see. In that case, I'll focus on the register renaming, live-range analysis, etc. The memory leak, which might not be actually lost memory, but just wasteful useage of memory, is pretty much related to the spill code. Most of the grotesqueness of the allocator is in the spiller. If this has a high probability of changing, I'll focus on other stuff, like register renaming, which is a lot harder, but with higher longterm payoff. In light of this, I think I'll send in my patch, in case it is helpful. The problem is that some routine outside of imc_reg_alloc() is sending in conflicting register preallocation. My code does lot's of assertions, and finds this, one way or another. I'll put in a variable to turn off color consistency checking, which will probably then pass the tests. Bill JEff leo
Re: Register allocation/spilling and register volitility
Bill Coffman wrote: In that case, I'll focus on the register renaming, live-range analysis, etc. Great. In light of this, I think I'll send in my patch, in case it is helpful. The problem is that some routine outside of imc_reg_alloc() is sending in conflicting register preallocation. Yep. I've already found two flaws within pcc.c. One is fixed (thanks to your allocation conflict check). The second is tricky: we have _global_dumper() ... object.dumper(pmc, name)# args are P5, S5 and _global_dumper() does: .return(object) The function is returning the object in P5, which means that return conventions are indicating a PMC result in P5, and that get's copied into the caller's frame. But the caller doesn't use it. The usage of P5 in the caller (pmc) is clobbered... I don't know, if that is an abuse of prototyped calling conventions (we need a definition WRT call/return argument mismatches). If not, we can never assign possible return values around a function call. Bill leo
Re: Register allocation/spilling and register volitility
Date: Fri, 12 Nov 2004 04:10:20 -0800 From: Bill Coffman [EMAIL PROTECTED] And spilling? Well, I'm proposing a variable-sized register frame. With very little additions we could run with more then 32 registers per kind (there are a few bitmasks currently that would need adaptions, but not much). I think this is a great idea. So do I. Having a fixed number of registes is like having the C keywords register0 through registerN. All registers should be $Pn registers and let the guts deal with how that maps physically. I certainly don't care which $Pn register maps to which Pn register, should a PIR programmer ever care? As far as I can tell the only reason why a PIR programmer would even use Pn registers is for implicit arguments, leading me into my next point: I strongly feel that Leo was right to suggest removing implicit register arguments from opcodes. I responded to that email but somehow it didn't get to the list AFAICT. I can't think of any good reason or use case for implicit arguments. Can one be provided? From my experiences with Parakeet, it's just more weird, implicit stuff that I have to keep in my head that confuses new PIR programmers who look at my code. Since I'm on the box I might as well get it all out: I don't think unprototyped pcc calls should use the first 11 arguments in P registers and the rest in an array, again it just doesn't make any sense to me other than to violate DON (Don't Optimize Now, in tribute to Don Knuth). Just pass an array and be done with it. If the guts want to register-optimize arguments then fine, why does the PIR programmer have to do it by hand? (this is really an specific extension of the argument against implicit register usage) Unprototyped arguments are usually built dynamically at run time anyway or called in from an external language. I wager the performance benefit is miniscule to nothing compared to the contortious PIR code needed to pack and unpack arguments into registers, spill arguments into an array, avoid those 11 P regisers all the time, and the pressure it all puts on the register allocator. For prototyped calls I can (barely) see this as being useful, but not unprototyped. Fixed registers, implicit arguments, and calling contortions are dead chickens (coined, AFAIK, by Jim Fulton). They're the completely useless things you have to wave over the machine to get it to do what you want. The less dead chickens you have to wave over Parrot, the better. -Michel
Re: Register allocation/spilling and register volitility
At 10:51 AM -0800 11/12/04, Michel Pelletier wrote: Date: Fri, 12 Nov 2004 04:10:20 -0800 From: Bill Coffman [EMAIL PROTECTED] And spilling? Well, I'm proposing a variable-sized register frame. With very little additions we could run with more then 32 registers per kind (there are a few bitmasks currently that would need adaptions, but not much). I think this is a great idea. So do I. I don't. It's staying the way it is, as are the implicit operands to some of the ops, and the unprototyped calling conventions. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: register allocation
Bill Coffman [EMAIL PROTECTED] wrote: Hello, I have addressed almost all the code issues from the last patch. * I have the below failed tests. Failed TestStat Wstat Total Fail Failed List of Failed --- t/library/dumper.t 13 332813 13 100.00% 1-13 t/op/gc.t 1 256181 5.56% 13 t/pmc/sub.t 1 256781 1.28% 78 These tests should be sane. While the first two .t are fairly complex, t/pmc/sub_78.imc is quite simple. You could compare the produced code of parrot -o- foo.imc. Run it with -t to see, where output differs. * Still have a memory leak that blocks Dan's evil code from compiling. $ valgrind --leak-check=yes --show-reachable=yes ./parrot --leak-test foo.imc (there are some other leaks, so just have a look at increasing leak counts for increased symbol usage) * Run time isn't so great, We'll address that later. Please post patch. leo
Re: Register allocation/spilling and register volitility
Jeff Clites [EMAIL PROTECTED] wrote: OTOH it doesn't really matter, if the context structure is in the frame too. We'd just need to skip that gap. REG_INT(64) or I64 is as valid as I0 or I4, as long as it's assured, that it's exactly addressing the incoming argument area of the called function. A problem with this is that you can't be sure that you can actually have the next frame of registers adjacent to the current frame--they might already be taken. Imagine A calls B, then B creates a continuation and stores it in a global, then returns. Please read the proposal summary by Miroslav Silovic, keyword watermark. If frames aren't adjacent, normal argument copying can be done anyway. Keep the old-scheme registers inside the interpreter structure, *as well as* the new indirect registers. Call the registers in the interpreter I0 to I31, and the indirect registers I32 to whatever. That would need two different addressing modes depending on the register number. That'll lead to considerable code bloat: we'd have all possible permutations for direct/indirect registers. Doing it at runtime only would be a serious slowdown. It's not needed. I've a better scheme in mind, which addressess efficieny as well as argument passing. leo
Re: Register allocation/spilling and register volitility
On Nov 8, 2004, at 1:34 AM, Leopold Toetsch wrote: Jeff Clites [EMAIL PROTECTED] wrote: OTOH it doesn't really matter, if the context structure is in the frame too. We'd just need to skip that gap. REG_INT(64) or I64 is as valid as I0 or I4, as long as it's assured, that it's exactly addressing the incoming argument area of the called function. A problem with this is that you can't be sure that you can actually have the next frame of registers adjacent to the current frame--they might already be taken. Imagine A calls B, then B creates a continuation and stores it in a global, then returns. Please read the proposal summary by Miroslav Silovic, keyword watermark. If frames aren't adjacent, normal argument copying can be done anyway. This would seem to require the same types of runtime checks that you are objecting to below, unless user code is expected to explicitly check whether it's supposed to be assigning to I3 v I(3 + 32x?). And that still seems to require copying, in my case of a function a() which calls a function b() with the exact same arguments. Keep the old-scheme registers inside the interpreter structure, *as well as* the new indirect registers. Call the registers in the interpreter I0 to I31, and the indirect registers I32 to whatever. That would need two different addressing modes depending on the register number. That'll lead to considerable code bloat: we'd have all possible permutations for direct/indirect registers. Doing it at runtime only would be a serious slowdown. I discussed that in item (1) under Further notes. It's not needed. I've a better scheme in mind, which addressess efficieny as well as argument passing. And spilling? JEff
Re: Register allocation/spilling and register volitility
Jeff Clites [EMAIL PROTECTED] wrote: On Nov 8, 2004, at 1:34 AM, Leopold Toetsch wrote: If frames aren't adjacent, normal argument copying can be done anyway. This would seem to require the same types of runtime checks that you are objecting to below, Not runtime. The register allocator knows the amount of needed non-volatiles and volatile registers. Depending on that a call can place arguments directly into the incoming regs of the caller or not. Then if at runtime copying is needed (because e.g. frames aren't adjacent), the function arguments are copyied. This is fully transparent. I discussed that in item (1) under Further notes. Yep, saw that then ;) It's not needed. I've a better scheme in mind, which addressess efficieny as well as argument passing. And spilling? Well, I'm proposing a variable-sized register frame. With very little additions we could run with more then 32 registers per kind (there are a few bitmasks currently that would need adaptions, but not much). But basically, spilling should not be needed at all, if the register allocator isn't as broken as it now is. Dan got some really evil subroutines, so we have RL test cases. We'll see. JEff leo
Re: register allocation questions
Bill Coffman [EMAIL PROTECTED] wrote: Currently, here's how the register allocator is doing. Failed TestStat Wstat Total Fail Failed List of Failed --- t/library/dumper.t5 1280135 38.46% 1-2 5 8 13 4 tests and 51 subtests skipped. Failed 1/123 test scripts, 99.19% okay. 5/1956 subtests failed, 99.74% okay. I recall Leo, or someone, saying that the data dumper routines are not following the calling convention properly. I didn't look too close, but it's probably only the entry points: .sub _dumper _global_dumper() That's missing C.param statements, so there are none. I've learned a lot about how the compiler works at this point, and I'd like to contribute more :) Great. Thanks. Would you like a patch? Should I fix the data dumper routines first? Definitely - dumper.t tests are currently disabled, don't worry. What is all this talk about deferred registers? What should I do next? deferred registers doesn't make bells ring. What do you mean with that? - Send patch with explanation of algorithm. Yes, I think we are kind of doing this. It's best to pass the registers straight through though. Like when a variable will be used as a parameter, give it the appropriate reg num. Sort of outside the immediate scope of register coloring, but as I've learned, one must go a little beyond, to see the input and output for each sub. Well, it's not really outside of register coloring. It's part of parrot's calling conventions. You can think of it as part of the Parrot machine ABI. When you write a compiler for darwin-PPC, you have to pass function arguments in r3, r4, ... and you get a return value in r3. If you don't do that, you'll not be able to make any C library call. In Parrot we have similar calling conventions and the register allocator must be aware of that. E.g. when you have: some_function() # (i, j) = some_function() $I0 = I5 $I1 = I6 you know that I5 and I6 are return results. The live range or the previous usage of I5 and I6 is cut by the function call. Using the return values directly is of course an optimization and not strictly necessary, nethertheless the allocator has to be aware that the function call invalidates previous I5 and I6. But the idea is to have each sub declare how many registers to save/restore. Don't worry about save/restore. That's already changed. imcc doesn't emit any savetop/restoretop or similar opcodes any more. Registers are preserved now by allocating a new register frame for the subroutine. We can also minimize this number to match the physical architecture that parrot is running on (for an arch specific optimization). Yes. I did that some time ago in imcc/jit.c, which produced register mapping for the underlying hardware CPU. Parrot registers 0.. n-1 were given negative numbers and src/jit.c used these directly as mapping for CPU registers. This vastly reduced JIT startup time. Yes, yes, renaming! I want to do register renaming! Go for it please. p31 holds all the spill stuff. It's a pain. Maybe I'll move that around, but if p31 is used, it means that there is no more room for symbols, in at least one of the reg sets. I'd say that with register renaming, spilling will be very rare. But there is of course no need to use P31 for it. If we really have to spill we can optimize that a bit. - Bill Coffman leo
Re: register allocation questions
Leopold Toetsch [EMAIL PROTECTED] wrote: Bill Coffman [EMAIL PROTECTED] wrote: t/library/dumper.t5 1280135 38.46% 1-2 5 8 13 I didn't look too close, but it's probably only the entry points: .sub _dumper _global_dumper() Fixed. leo
Re: register allocation questions
At 9:36 PM +0200 10/27/04, Leopold Toetsch wrote: Dan Sugalski wrote: At 11:09 AM +0200 10/26/04, Leopold Toetsch wrote: So, if you want that really super efficient, you would allocate registers around function calls directly to that wanted register number, which should be in the SymReg's want_regno. While true, in the general case leaving 0-15 as non-preferred registers will probably make things easier. Those registers, especially the PMC ones, are going to see a lot of thrash as function calls are made, and it'll probably be easier to have them as scratch registers. Yep, that's the easy part ;) OTOH when the register allocator is doing register renaming anyway, the most inner loop with a function call should get registers assigned already matching the calling convemtions. With more then one call at that loop level, you have to move around registers anyway. Oh, sure, but keeping your scratch PMCs out of the way makes life a lot easier for the register coloring algorithms. Might not be optimal, but if it makes life simpler to start, optimal can come later. It's distinctly possible, of course, that there'll be very little pressure to actually *use* them for most code, as we've got plenty of registers in general. That's the hope, at least. Yes, 16 regs are plenty and do suffice for all normal[1] code. Assigning to wanted reg numbers for a function is a nice optimization. [1] all except Dan's 6000 lines subroutines :) Did you start creating real subs for your code already? I wish. :( Unfortunately not, outside some simple stuff, and I doubt I will. The language just doesn't lend itself to that sort of thing. We're going to add actual real subroutines to the language after we roll out into production, but that doesn't help now, alas. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: register allocation questions
Hi all, Thanks for your continued comments. Btw, I usually read all the parrot list, so don't think I'm not paying attention. Currently, here's how the register allocator is doing. Failed TestStat Wstat Total Fail Failed List of Failed --- t/library/dumper.t5 1280135 38.46% 1-2 5 8 13 4 tests and 51 subtests skipped. Failed 1/123 test scripts, 99.19% okay. 5/1956 subtests failed, 99.74% okay. I recall Leo, or someone, saying that the data dumper routines are not following the calling convention properly. So I've decided not to worry about it too much. It passes the other tests, plus the randomized tests that I created, up to 150 symbols. At that range, it still takes about 20x longer than g++ -O2, for equivalent programs to compile (see gen4.pl). Also, it is currently running about O(n^2) for n symbols, where the old one was running about O(n^3) from my analysis. The spill code is still very expensive, and has a large constant associate. I also have data, which is attached. The difference doesn't show up until a lot of spilling is going on, around 80 symbols or so. I've learned a lot about how the compiler works at this point, and I'd like to contribute more :) Would you like a patch? Should I fix the data dumper routines first? What is all this talk about deferred registers? What should I do next? Well, I'm making some comments on the below stuff. On Thu, 28 Oct 2004 09:07:05 -0400, Dan Sugalski [EMAIL PROTECTED] wrote: At 9:36 PM +0200 10/27/04, Leopold Toetsch wrote: Dan Sugalski wrote: At 11:09 AM +0200 10/26/04, Leopold Toetsch wrote: So, if you want that really super efficient, you would allocate registers around function calls directly to that wanted register number, which should be in the SymReg's want_regno. Yes, I think we are kind of doing this. It's best to pass the registers straight through though. Like when a variable will be used as a parameter, give it the appropriate reg num. Sort of outside the immediate scope of register coloring, but as I've learned, one must go a little beyond, to see the input and output for each sub. While true, in the general case leaving 0-15 as non-preferred registers will probably make things easier. Those registers, especially the PMC ones, are going to see a lot of thrash as function calls are made, and it'll probably be easier to have them as scratch registers. I guess I don't agree. I'd like to pack down the number of registers used to a minimum. Then when a function is called, only those needed registers are copied in/out. Don't think the functionality exists. But the idea is to have each sub declare how many registers to save/restore. This would then save 0-k such registers. Where k is the number of registers used by the sub. Pack 'em down, minimize the number needed. We can also minimize this number to match the physical architecture that parrot is running on (for an arch specific optimization). The imc_reg_alloc function does not have 32 hard coded in there (well a little bit, but can be easily changed). It's pretty dynamic. Yep, that's the easy part ;) OTOH when the register allocator is doing register renaming anyway, the most inner loop with a function call should get registers assigned already matching the calling convemtions. With more then one call at that loop level, you have to move around registers anyway. Yes, yes, renaming! I want to do register renaming! Oh, sure, but keeping your scratch PMCs out of the way makes life a lot easier for the register coloring algorithms. Might not be optimal, but if it makes life simpler to start, optimal can come later. p31 holds all the spill stuff. It's a pain. Maybe I'll move that around, but if p31 is used, it means that there is no more room for symbols, in at least one of the reg sets. [1] all except Dan's 6000 lines subroutines :) Did you start creating real subs for your code already? I wish. :( Unfortunately not, outside some simple stuff, and I doubt I will. The language just doesn't lend itself to that sort of thing. We're going to add actual real subroutines to the language after we roll out into production, but that doesn't help now, alas. Interesting. I'd like to test on something like that. Maybe SPEC99 as well. - Bill Coffman compile.dat Description: Binary data compile.plot Description: Binary data attachment: compile.png
Re: register allocation questions
At 3:08 PM -0700 10/28/04, Bill Coffman wrote: It passes the other tests, plus the randomized tests that I created, up to 150 symbols. At that range, it still takes about 20x longer than g++ -O2, for equivalent programs to compile (see gen4.pl). Still, that's not bad. Also, it is currently running about O(n^2) for n symbols, where the old one was running about O(n^3) from my analysis. The spill code is still very expensive, and has a large constant associate. I also have data, which is attached. The difference doesn't show up until a lot of spilling is going on, around 80 symbols or so. I'm curious to see how it behaves once the spilling gets up into the 1000+ symbol range. Dropping from cubic to quadratic time ought to make a not-insignificant change in the running time, even if that constant's pretty big. :) I've learned a lot about how the compiler works at this point, and I'd like to contribute more :) Would you like a patch? Yes! Oh, yeah, definitely. Well, I'm making some comments on the below stuff. On Thu, 28 Oct 2004 09:07:05 -0400, Dan Sugalski [EMAIL PROTECTED] wrote: At 9:36 PM +0200 10/27/04, Leopold Toetsch wrote: Dan Sugalski wrote: At 11:09 AM +0200 10/26/04, Leopold Toetsch wrote: While true, in the general case leaving 0-15 as non-preferred registers will probably make things easier. Those registers, especially the PMC ones, are going to see a lot of thrash as function calls are made, and it'll probably be easier to have them as scratch registers. I guess I don't agree. I'd like to pack down the number of registers used to a minimum. Then when a function is called, only those needed registers are copied in/out. Don't think the functionality exists. But the idea is to have each sub declare how many registers to save/restore. This would then save 0-k such registers. Where k is the number of registers used by the sub. Pack 'em down, minimize the number needed. We can also minimize this number to match the physical architecture that parrot is running on (for an arch specific optimization). The imc_reg_alloc function does not have 32 hard coded in there (well a little bit, but can be easily changed). It's pretty dynamic. By all means, go for it. I certainly don't want to curb your enthusiasm. It's the right thing to do, ultimately. I didn't want to presume on your time. Happy to have it, of course. :) [1] all except Dan's 6000 lines subroutines :) Did you start creating real subs for your code already? I wish. :( Unfortunately not, outside some simple stuff, and I doubt I will. The language just doesn't lend itself to that sort of thing. We're going to add actual real subroutines to the language after we roll out into production, but that doesn't help now, alas. Interesting. I'd like to test on something like that. Maybe SPEC99 as well. If you've got a patch, I'd be more than happy to give it a whirl, and I can likely get you a copy of the code in question to give a run on. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: register allocation
When I cvs up'd, cleared and reConfigure'd I got these stats: Failed Test Stat Wstat Total Fail Failed List of Failed --- t/library/streams.t1 256211 4.76% 11 t/op/gc.t 1 256181 5.56% 13 4 tests and 66 subtests skipped. Failed 2/124 test scripts, 98.39% okay. 2/1957 subtests failed, 99.90% okay. I'm curious to see how it behaves once the spilling gets up into the 1000+ symbol range. Dropping from cubic to quadratic time ought to make a not-insignificant change in the running time, even if that constant's pretty big. :) I think it's a bit more complicated. M=number lines in code, N=number variables. - Time = O(M^2+N^2) - old time = O(M^2 +N^3) Not quite sure of this either. But the N^3 eventually dominates, I think. The data seems to bear this out. There are more fixes I'd like to make as well. I spotted several things that could be fixed. And I think the spill code can be optimized a lot to reduce the big-O time as well. More statistics #vars v. time in seconds: #vars gcc parrot2 200 7.92 89.20 201 11.86 146.31 202 18.11 246.37 203 9.54 107.88 204 11.81 134.60 205 14.75 190.95 206 13.25 161.83 207 10.63 138.83 208 11.02 117.73 209 7.14 88.29 210 15.14 176.69 I am also running gen3.pl with 1000 vars. It's still on gcc. We'll see if parrot doesn't crash my 1Gigabyte, 2.4Ghz workstation tonight. Would you like a patch? Yes! Oh, yeah, definitely. [...] If you've got a patch, I'd be more than happy to give it a whirl, and I can likely get you a copy of the code in question to give a run on. Soon, I'll send one the proper way. The imc_reg_alloc function does not have 32 hard coded in there (well a little bit, but can be easily changed). It's pretty dynamic. By all means, go for it. I certainly don't want to curb your enthusiasm. It's the right thing to do, ultimately. I didn't want to presume on your time. Happy to have it, of course. :) Thanks. I've had a great time doing this. Remembering graph algorithms and compilers. Great fun! I'd also like to contribute to getting Parrot out there, sooner rather than later. So if I can help with that, I'd like to hear suggestions. -Bill
Re: register allocation questions
Thanks Matt, I hope I can help out. The patch I am submitting actually does simplify register coloring a bit. I've been waiting for perl6 with so much anticipation, I just couldn't stand it any more, and I had to participate. -Bill On Thu, 28 Oct 2004 18:17:57 -0400, Matt Fowles [EMAIL PROTECTED] wrote: Bill~ I have to say that I am really impressed by all of the work that you are doing, and if you can make the internals of imcc a little more approachable, you would be doing a great service. Thanks, Matt
Re: register allocation questions
At 11:09 AM +0200 10/26/04, Leopold Toetsch wrote: Bill Coffman [EMAIL PROTECTED] wrote: 1) In the existing parrot code, when a register is assigned, it uses the following code: int c = (color + MAX_COLOR/2) % MAX_COLOR; Thus, it seems to prefer to use register #16 and up, first, before it uses 0-15. This is mistifying to me, since it's not so trivial to code it this way, Well, some time ago, register allocation started at zero. The split of register frames in upper and lower halfs *plus* the premature optimization to save only the upper half of registers made it necessary to allocate from 16 up. But things are a bit more compilcated. For function calls, we are passing arguments from register x5 ..x15 and I0..I4 plus some more have a special meaning. See docs/ppds/pdd03_calling_conventions for all the details. The same convention is used for function returns. So, if you want that really super efficient, you would allocate registers around function calls directly to that wanted register number, which should be in the SymReg's want_regno. While true, in the general case leaving 0-15 as non-preferred registers will probably make things easier. Those registers, especially the PMC ones, are going to see a lot of thrash as function calls are made, and it'll probably be easier to have them as scratch registers. It's distinctly possible, of course, that there'll be very little pressure to actually *use* them for most code, as we've got plenty of registers in general. That's the hope, at least. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: register allocation questions
Dan Sugalski wrote: At 11:09 AM +0200 10/26/04, Leopold Toetsch wrote: So, if you want that really super efficient, you would allocate registers around function calls directly to that wanted register number, which should be in the SymReg's want_regno. While true, in the general case leaving 0-15 as non-preferred registers will probably make things easier. Those registers, especially the PMC ones, are going to see a lot of thrash as function calls are made, and it'll probably be easier to have them as scratch registers. Yep, that's the easy part ;) OTOH when the register allocator is doing register renaming anyway, the most inner loop with a function call should get registers assigned already matching the calling convemtions. With more then one call at that loop level, you have to move around registers anyway. It's distinctly possible, of course, that there'll be very little pressure to actually *use* them for most code, as we've got plenty of registers in general. That's the hope, at least. Yes, 16 regs are plenty and do suffice for all normal[1] code. Assigning to wanted reg numbers for a function is a nice optimization. [1] all except Dan's 6000 lines subroutines :) Did you start creating real subs for your code already? leo
Re: register allocation questions
Bill Coffman [EMAIL PROTECTED] wrote: Hello All, I have been hard at work, trying to grok the reg_alloc.c code, and with some success. My code is assigning registers, so that none are conflicting (which I double-verify), and I'm getting to the end of make. Wow. 1) In the existing parrot code, when a register is assigned, it uses the following code: int c = (color + MAX_COLOR/2) % MAX_COLOR; Thus, it seems to prefer to use register #16 and up, first, before it uses 0-15. This is mistifying to me, since it's not so trivial to code it this way, Well, some time ago, register allocation started at zero. The split of register frames in upper and lower halfs *plus* the premature optimization to save only the upper half of registers made it necessary to allocate from 16 up. But things are a bit more compilcated. For function calls, we are passing arguments from register x5 ..x15 and I0..I4 plus some more have a special meaning. See docs/ppds/pdd03_calling_conventions for all the details. The same convention is used for function returns. So, if you want that really super efficient, you would allocate registers around function calls directly to that wanted register number, which should be in the SymReg's want_regno. 2) When function imc_reg_alloc (the main register allocation thingy) is called, some of the variables have already expressed a preference as to which register they want. I understand that this can optimize certain parameter passing, etc. The question that arrises is, what if two conflicting variables want the same color (reg#)? Obviously, they don't get their way. But what are the consequences, and what must be done to fix this. Ah, ok. For each function call registers are moved around to the correct number, if they aren't there. This is done by code in imcc/pcc.c. But the problem still is there that your code really shouldn't assign these registers in a conflicting way. ... Incidentally, reg allocation is done on the following subs: _main __library_dumper_onload _dumper _register_dumper _global_dumper [ ...] Sure. Register allocation code is done per .sub/.end always. So that's fine. But you might start with simpler source code having one or two function calls only. Maybe someone can point me at something to look at to fix this. I'll provide the patch if someone would like to play with it. The main problem probably is to follow register usage in calling conventions. BTW: preserving the upper half of registers will be tossed soon. To simulate this behavior in current CVS, you could run the code with -Oc, which should preserve all allocated registers. I don't know, when the indirect register frame patch whill hit CVS, but it should be in a few days. With that in place, you can allocate registers as you like, with the caveat that rules in PDD03 are used. Thanks to all who made it this far, Bill Coffman leo
Re: register allocation questions
At 6:18 PM -0700 10/25/04, Bill Coffman wrote: Hello All, I have been hard at work, trying to grok the reg_alloc.c code, and with some success. My code is assigning registers, so that none are conflicting (which I double-verify), and I'm getting to the end of make. Woohoo! A non-trivial thing. :) Here's some questions: 1) In the existing parrot code, when a register is assigned, it uses the following code: for (color = 0; color MAX_COLOR; color++) { int c = (color + MAX_COLOR/2) % MAX_COLOR; if (!colors[c]) { reglist[x]-color = c; debug(interpreter, DEBUG_IMC, #[%s] provisionally gets color [%d] (%d free colors, score %d)\n, reglist[x]-name, c, free_colors, reglist[x]-score); break; } } Thus, it seems to prefer to use register #16 and up, first, before it uses 0-15. This is mistifying to me, since it's not so trivial to code it this way, and yet I cannot figure out why this color preference. It seems like maybe there's a reason which I need to understand here. My new code doesn't do this behaviour. Registers 0-15 are the registers used for sub/method/function calls. Leaving them empty makes calls easier -- if the low-registers are in use you need to shuffle the contents somewhere else. [Snippage] Looks like we've got some... incidental behavior. This isn't particularly good. Feel free to rip out anything you want -- this is all internals stuff, much of it's first-generation code, and nobody's got any attachment to it. (I hope :) Incidentally, reg allocation is done on the following subs: _main __library_dumper_onload _dumper _register_dumper _global_dumper __library_data_dumper_onload dumper __library_data_dumper_default_onload dumpWithName dumpCached dumpProperties dumpHash dumpStringEscaped pmcDefault pmcPMCArray pmcIntList pmcStringArray pmcPerlArray pmcPerlHash pmcPerlString pmcPerlInt pmcPerlNum pmcPerlUndef pmcSub pmcNull pmcOrderedHash pmcManagedStruct pmcUnManagedStruct pmcArray __library_data_dumper_base_onload prepare cache createIndent indent newIndent deleteIndent dump simple String which seems like a lot to me, but I guess it's compiling a lot of stuff. That doesn't seem right. Leo may well have an idea of why. On the other hand, I'm fine with ripping a lot of this stuff out and centralizing it. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: register allocation
On August 9, 2004 02:34 pm, Dan Sugalski wrote: At 4:12 PM -0400 8/9/04, Melvin Smith wrote: At 11:22 AM 8/9/2004 -0400, Dan Sugalski wrote: At 1:13 PM +0200 8/8/04, Leopold Toetsch wrote: Leopold Toetsch [EMAIL PROTECTED] wrote: You can verify this step by running -v: $ parrot -v inv_mod.imc 21 | grep symb build_reglist: 5783 symbols allocate_non_interfering, now: 2205 symbols It really should help. It did. I'm not sure how long the build ran (the last time I checked it'd racked up over 150 minutes of CPU time on a 2GHz dual-processor system, and may have run up to about 180 CPU minutes) but it finished. Now we need to cut down the runtimes just a touch. :) DSWEPIC Dan Stop Writing Evil Pathological Intermediate Code. Hah! I wish I was writing this. Instead, I'm writing a compiler for an Evil Pathological Language. :( Perl? :) -- Thomas Fjellstrom [EMAIL PROTECTED] http://strangesoft.net
Re: register allocation
At 12:24 AM -0600 8/13/04, Thomas Fjellstrom wrote: On August 9, 2004 02:34 pm, Dan Sugalski wrote: At 4:12 PM -0400 8/9/04, Melvin Smith wrote: At 11:22 AM 8/9/2004 -0400, Dan Sugalski wrote: At 1:13 PM +0200 8/8/04, Leopold Toetsch wrote: Leopold Toetsch [EMAIL PROTECTED] wrote: You can verify this step by running -v: $ parrot -v inv_mod.imc 21 | grep symb build_reglist: 5783 symbols allocate_non_interfering, now: 2205 symbols It really should help. It did. I'm not sure how long the build ran (the last time I checked it'd racked up over 150 minutes of CPU time on a 2GHz dual-processor system, and may have run up to about 180 CPU minutes) but it finished. Now we need to cut down the runtimes just a touch. :) DSWEPIC Dan Stop Writing Evil Pathological Intermediate Code. Hah! I wish I was writing this. Instead, I'm writing a compiler for an Evil Pathological Language. :( Perl? :) You kidding? Perl's got nothing on this... thing. More than all the annoyance, with none of the redeeming features, though it does work. (On the other hand, I'm quite looking forward to being able to let it die in obscurity where it belongs :) -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: register allocation
Melvin Smith [EMAIL PROTECTED] wrote: Since that is my comment I'll comment. When I wrote the allocator, I oriented it around a basic block, and I figured a basic block would never have more than a few hundred symbols, so the N x N array was the fastest possible data structure for keeping the interference data. The problem is that the matrix isn't built per block but per compilation unit. And all symbols of all 4 kinds are inside one graph albeit they can't interfer. The first thing todo is probably to split the matrix into four pieces. The whole register allocation could be done in four passes too, but that adds runtime overhead. OTOH data structures get smaller. I'm not sure how many symbols we are talking about here, it would help if someone threw out a number before I go and make the wrong statement. Dan posted some numbers: P 26951, S 2099, I 8878 for evil program. The program dies because of memory shortage when setting up the register life information. I'm pretty sure that it would help if imcc had two modes of symbol analysis. Routine wide register allocation ends up with better quality code, but at the cost of compilation time and possible cases that can't be handled with the current design. If imcc sees too many symbols, it should try to break the sub down into basic blocks for interference analysis, but that also adds additional code requirements for when you put the basic blocks together and figure out loads, spills and reuse across basic blocks. Well, I think we have the information to allocate registers per basic block. It would need some rearrangment of data structures, though. Anyway, I already have added a first allocation run that allocates all temporary registers of all blocks in one bunch. This pass doesn't need life range structures. All these aready allocated registers are not part of the next computations and are not in the interference gprah any more. -Melvin leo
Re: register allocation
At 1:13 PM +0200 8/8/04, Leopold Toetsch wrote: Leopold Toetsch [EMAIL PROTECTED] wrote: You can verify this step by running -v: $ parrot -v inv_mod.imc 21 | grep symb build_reglist: 5783 symbols allocate_non_interfering, now: 2205 symbols It really should help. It did. I'm not sure how long the build ran (the last time I checked it'd racked up over 150 minutes of CPU time on a 2GHz dual-processor system, and may have run up to about 180 CPU minutes) but it finished. Now we need to cut down the runtimes just a touch. :) -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: register allocation
At 11:22 AM 8/9/2004 -0400, Dan Sugalski wrote: At 1:13 PM +0200 8/8/04, Leopold Toetsch wrote: Leopold Toetsch [EMAIL PROTECTED] wrote: You can verify this step by running -v: $ parrot -v inv_mod.imc 21 | grep symb build_reglist: 5783 symbols allocate_non_interfering, now: 2205 symbols It really should help. It did. I'm not sure how long the build ran (the last time I checked it'd racked up over 150 minutes of CPU time on a 2GHz dual-processor system, and may have run up to about 180 CPU minutes) but it finished. Now we need to cut down the runtimes just a touch. :) DSWEPIC Dan Stop Writing Evil Pathological Intermediate Code. -Melvin
Re: register allocation
On Mon, 2004-08-09 at 13:12, Melvin Smith wrote: DSWEPIC Dan Stop Writing Evil Pathological Intermediate Code. Technically, that should be CWSWCTWEPIC, or Compiler Writers Stop Writing Compilers That Write Evil Pathological Intermediate Code. Automating the process of shuddering while reading generated code certainly helped me sleep more restfully. -- c
Re: register allocation
Dan Sugalski [EMAIL PROTECTED] wrote: The code, when I try with -v, gives: debug = 0x0 Reading forms/order.imc using optimization '0' (0) Starting parse... warning:imcc:build_reglist: probably too small HASH_SIZE (29451 symbols) error:imcc:make_life_range: Out of mem calloc-ing 12 bytes And then dies. Ok. Too many symbols still. I've now enabled again ALLOCATE_HACK (which should get renamed, if it works :) Before the life analysis is started, 3 registers of one kind get allocated in one pass, if they are in one basic block only and don't overlap. This should color a lot of the temps immediately. These already colored temps are then not considered in further computations. You can verify this step by running -v: build_reglist: 36 symbols allocate_non_interfering, now: 12 symbols This is from a program that does arithmetics with the typical chain of intermediate temps. leo
Re: register allocation
At 10:54 AM 8/7/2004 -0400, Dan Sugalski wrote: At 12:45 PM +0200 8/7/04, Leopold Toetsch wrote: Sean O'Rourke [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] (Leopold Toetsch) writes: The interference_graph size is n_symbols * n_symbols * sizeof(a_pointer). This might already be too much. 2) There is a note in the source code that the interference graph could be done without the N x N graph array. Any hints welcome (Angel Faus!). Since that is my comment I'll comment. When I wrote the allocator, I oriented it around a basic block, and I figured a basic block would never have more than a few hundred symbols, so the N x N array was the fastest possible data structure for keeping the interference data. I'm not sure how many symbols we are talking about here, it would help if someone threw out a number before I go and make the wrong statement. I'm pretty sure that it would help if imcc had two modes of symbol analysis. Routine wide register allocation ends up with better quality code, but at the cost of compilation time and possible cases that can't be handled with the current design. If imcc sees too many symbols, it should try to break the sub down into basic blocks for interference analysis, but that also adds additional code requirements for when you put the basic blocks together and figure out loads, spills and reuse across basic blocks. There were a lot of things that I had planned for imcc that just never materialized. It is still very primitive when it comes to symbol analysis, even with all the work Leo has done. -Melvin
Re: register allocation
[EMAIL PROTECTED] (Leopold Toetsch) writes: The interference_graph size is n_symbols * n_symbols * sizeof(a_pointer). This might already be too much. 2) There is a note in the source code that the interference graph could be done without the N x N graph array. Any hints welcome (Angel Faus!). It looks like the way things are used in the code, you can use an adjacency list instead of the current adjacency matrix for the graph. If interference is typically sparse, which I think it should be, this is a win. So Dan -- while you're in there with the debugging statements, you might also want to keep a count of how many registers interfere with each other. /s
Re: register allocation
Sean O'Rourke [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] (Leopold Toetsch) writes: The interference_graph size is n_symbols * n_symbols * sizeof(a_pointer). This might already be too much. 2) There is a note in the source code that the interference graph could be done without the N x N graph array. Any hints welcome (Angel Faus!). It looks like the way things are used in the code, you can use an adjacency list instead of the current adjacency matrix for the graph. Yeah. Or a bitmap. Or still better, create the interference graph per basic block. Should be much smaller then. /s leo
Re: register allocation
At 2:23 PM +0200 8/6/04, Leopold Toetsch wrote: Leopold Toetsch wrote: 4.2) Basic blocks, life span of symbols inside these blocks and loop detection. Baisc block generation depends on two items: * labels - a possible branch target * branching opcodes Detection of branching opcodes depends currently in a not too obvious way[1] on directives in ops/*.ops like: goto NEXT() # no branch, just execute next instruction goto ADDRESS(next) # a branch Now all find_global and find_lex opcodes *did* use goto ADDRESS(next). This might be some relict of experiments with exceptions. These opcodes might branch elsewhere, if an exception had occured. But its not a branch. I've change these opcodes to use goto NEXT() and excluded PARROT_JUMP_ENEXT from considering this opcode as a branch. This fix should vastly improve PIR programs that make heavy use of find_lex or find_global opcodes (hello Dan). Hello indeed! Yeah, I make a near-insane use of find_global, as a way to avoid huge numbers of .local declarations. This patch took care of the double-spill error and a few of the out of memory errors for too-complex source. I think I'm down to five forms that don't compile because they're too nasty for parrot, which is better than it's been. (I've been tweaking the compiler some as I go along as well to try and simplify things, so there have been wins there too) -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: register allocation
At 9:21 AM -0700 8/6/04, Sean O'Rourke wrote: [EMAIL PROTECTED] (Leopold Toetsch) writes: The interference_graph size is n_symbols * n_symbols * sizeof(a_pointer). This might already be too much. 2) There is a note in the source code that the interference graph could be done without the N x N graph array. Any hints welcome (Angel Faus!). It looks like the way things are used in the code, you can use an adjacency list instead of the current adjacency matrix for the graph. If interference is typically sparse, which I think it should be, this is a win. So Dan -- while you're in there with the debugging statements, you might also want to keep a count of how many registers interfere with each other. I'll dig in and see if I can throw that info out. I may also add a counter to see how many trips through the spill loop each program takes. That's where most of the time's being taken, and I'm wondering if there's a cutoff point--if you hit N times around the loop you're there forever, or something. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: register allocation
At 12:45 PM +0200 8/7/04, Leopold Toetsch wrote: Sean O'Rourke [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] (Leopold Toetsch) writes: The interference_graph size is n_symbols * n_symbols * sizeof(a_pointer). This might already be too much. 2) There is a note in the source code that the interference graph could be done without the N x N graph array. Any hints welcome (Angel Faus!). It looks like the way things are used in the code, you can use an adjacency list instead of the current adjacency matrix for the graph. Yeah. Or a bitmap. Or still better, create the interference graph per basic block. Should be much smaller then. I'm not 100% sure since I've not thrown in a counter yet, but I think that it's ultimately not a size problem, rather a case of code that can't *ever* be properly colored with the algorithm and implementation we have. (Not that reducing the size is a bad thing, though -- I'm all for that) -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
[PATCH] Re: register allocation
On Aug-07, Leopold Toetsch wrote: Sean O'Rourke [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] (Leopold Toetsch) writes: The interference_graph size is n_symbols * n_symbols * sizeof(a_pointer). This might already be too much. 2) There is a note in the source code that the interference graph could be done without the N x N graph array. Any hints welcome (Angel Faus!). It looks like the way things are used in the code, you can use an adjacency list instead of the current adjacency matrix for the graph. Yeah. Or a bitmap. An adjacency list would definitely be much smaller, but I'd be concerned that it would slow down searches too much. I think the bitmap might be worth a try just to see how much the size matters. Since this is an n^2 issue, splitting out the four different register types could help -- except that I'd guess that most code with excessive register usage probably uses one type of register much more than the rest. Anyway, I've attached a patch that uses bitmaps instead of SymReg*'s, which should give a factor of 32 size reduction. I've only tested it by doing a 'make test' and verifying that the several dozen test failures are the same before and after (I don't think things are actually that broken; I think the make system is), but for all I know it's not even calling the code. That's what you get when I only have a two hour hacking window and I've never looked at the code before. Or still better, create the interference graph per basic block. Should be much smaller then. Huh? Is register allocation done wholly within basic blocks? I thought the point of the graph was to compute interference across basic blocks. I guess I should go and actually read the code. Index: imcc/reg_alloc.c === RCS file: /cvs/public/parrot/imcc/reg_alloc.c,v retrieving revision 1.14 diff -u -r1.14 reg_alloc.c --- imcc/reg_alloc.c23 Apr 2004 14:09:33 - 1.14 +++ imcc/reg_alloc.c7 Aug 2004 07:11:08 - @@ -41,7 +41,7 @@ static void compute_du_chain(IMC_Unit * unit); static void compute_one_du_chain(SymReg * r, IMC_Unit * unit); static int interferes(IMC_Unit *, SymReg * r0, SymReg * r1); -static int map_colors(int x, SymReg ** graph, int colors[], int typ); +static int map_colors(IMC_Unit *, int x, int * graph, int colors[], int typ); #ifdef DO_SIMPLIFY static int simplify (IMC_Unit *); #endif @@ -58,12 +58,46 @@ /* XXX FIXME: Globals: */ static IMCStack nodeStack; -static SymReg** interference_graph; -/* -static SymReg** reglist; -*/ +static int* interference_graph; static int n_symbols; +static int* ig_get_word(int i, int j, int N, int* graph, int* bit_ofs) +{ +int bit = i * N + j; +*bit_ofs = bit % sizeof(*graph); +return graph[bit / sizeof(*graph)]; +} + +static void ig_set(int i, int j, int N, int* graph) +{ +int bit_ofs; +int* word = ig_get_word(i, j, N, graph, bit_ofs); +*word |= (1 bit_ofs); +} + +static void ig_clear(int i, int j, int N, int* graph) +{ +int bit_ofs; +int* word = ig_get_word(i, j, N, graph, bit_ofs); +*word = ~(1 bit_ofs); +} + +static int ig_test(int i, int j, int N, int* graph) +{ +int bit_ofs; +int* word = ig_get_word(i, j, N, graph, bit_ofs); +return *word (1 bit_ofs); +} + +static int* ig_allocate(int N) +{ +// size is N*N bits, but we want don't want to allocate a partial +// word, so round up to the nearest multiple of sizeof(int). +int need_bits = N * N; +int num_words = (need_bits + sizeof(int) - 1) / sizeof(int); +return (int*) calloc(num_words, sizeof(int)); +} + /* imc_reg_alloc is the main loop of the allocation algorithm. It operates * on a single compilation unit at a time. */ @@ -446,6 +480,12 @@ /* creates the interference graph between the variables. * + * data structure is a 2-d array 'interference_graph' where row/column + * indices represent the same index in the list of all symbols + * (unit-reglist) in the current compilation unit. The value in the + * 2-d array interference_graph[i][j] is the symbol unit-reglist[j] + * itself. + * * two variables interfere when they are alive at the * same time */ @@ -461,7 +501,7 @@ /* Construct a graph N x N where N = number of symbolics. * This piece can be rewritten without the N x N array */ -interference_graph = calloc(n_symbols * n_symbols, sizeof(SymReg*)); +interference_graph = ig_allocate(n_symbols); if (interference_graph == NULL) fatal(1, build_interference_graph,Out of mem\n); unit-interference_graph = interference_graph; @@ -475,8 +515,8 @@ if (!unit-reglist[y]-first_ins) continue; if (interferes(unit, unit-reglist[x], unit-reglist[y])) { -interference_graph[x*n_symbols+y] = unit-reglist[y]; -interference_graph[y*n_symbols+x] = unit-reglist[x]; +ig_set(x, y, n_symbols, interference_graph);
Re: register allocation (was: Spilling problems)
At 9:51 AM +0200 8/6/04, Leopold Toetsch wrote: Dan Sugalski wrote: ... extraordinarily large (60-80k lines) ... ... dying with an out-of-memory error. How many symbols does the sub have? Please set a breakpoint at imcc/reg_alloc.c:468 and inspect: n_symbols (or just create a debug print statement there) The interference_graph size is n_symbols * n_symbols * sizeof(a_pointer). This might already be too much. I don't even get to this spot. (Though at some point we need to do something about some of this code, as I see a fair number of globals, which makes reentrancy an issue. But that's for later) The code, when I try with -v, gives: debug = 0x0 Reading forms/order.imc using optimization '0' (0) Starting parse... warning:imcc:build_reglist: probably too small HASH_SIZE (29451 symbols) error:imcc:make_life_range: Out of mem calloc-ing 12 bytes And then dies. 2) There is a note in the source code that the interference graph could be done without the N x N graph array. Any hints welcome (Angel Faus!). That'd probably be a good thing... :) 3) We don't rename registers (except when it comes to spilling). So when a symbol starts its life somewhere at the beginning of the subroutine and is used down to the end, it interfers with each other symbol in that range. One register is blocked permanently. When there are around 30 such long-ranged symbols, spilling starts, which means that the variable is stored into and fetched from a spill array. This could be avoided, if the symbol is a lexical or a global variable. On each fetch of this variable it could get a fresh register. This currently depends on the source code: foo = global foo # bad $P1234 = global foo # good - if each fetch increments the $P... Right, and this is good info to leave in the archive. I've long-since converted away from named .locals to $Pxxx temps, which made quite a difference -- probably made another 15-20% of my source compilable. Most of the temps are very short-lived, though I put in a caching scheme to keep from fetching things too often. (All my library subs have to be fetched, and I tried to cut down on the number of redundant fetches there) 4) Proper register allocation depends on a correct life analysis of the symbol, which depends basically on 2 things: 4.1) the usage of the register in the opcode. Have a look at ops/*.ops: op add(in PMC, in PMC) The destination (left hand side) of add is a in register, the PMC has to exist beforehand as well as the source operand of course. op find_lex(out PMC, in STR) The destination PMC register is filled with the address of the lexical. It gets a new pointer there, the usage is written as out. This means for the register allocator that the life range of this register starts here. Whatever was in that register is dead. But we have opcodes that silently have such an influence on the registers life range. These are mainly stack pop opcodes. Some of these opcodes are handled manually inside imcc, but not all. This leads to wrong or too long life ranges and to spilling stress. There was a TODO item (posted on the list) to mark up all opcodes in the source to carry along their register usage. This is still badly required. Hrm. Put a notation scheme in place and I'll see about getting that done. 4.2) Basic blocks, life span of symbols inside these blocks and loop detection. A basic block is a stream of opcodes that runs in one step without a branch and it isn't a branch target - no other opcode can jump inside such a basic block. If a symbol is used as in inside of such a block, it had to exist beforehand. Its life range is propagated down to the previous block. This is straight forward for branches but gets nasty with loops and subroutines. The code in cfg.c is cluttered by exceptions that try to follow the control flow of stack-based subroutines (bsr) and CPS subroutines. invoke. This code needs cleaning up and I'm willing to eliminate support for old code that uses bsr and doesn't preserve registers. The main problem with invoke is that it doesn't carry a specific branch destination. But anyway basic block detechtion should be ok. *But* its totally untested, which is bad. Any hints for testing it, patches, test frame work is really welcome. Figure out what you want and I'll come up with more test cases than you'd ever want. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
[PATCH] Re: register allocation
At 9:21 AM -0700 8/7/04, Steve Fink wrote: On Aug-07, Leopold Toetsch wrote: Sean O'Rourke [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] (Leopold Toetsch) writes: The interference_graph size is n_symbols * n_symbols * sizeof(a_pointer). This might already be too much. 2) There is a note in the source code that the interference graph could be done without the N x N graph array. Any hints welcome (Angel Faus!). It looks like the way things are used in the code, you can use an adjacency list instead of the current adjacency matrix for the graph. Yeah. Or a bitmap. An adjacency list would definitely be much smaller, but I'd be concerned that it would slow down searches too much. I think the bitmap might be worth a try just to see how much the size matters. Since this is an n^2 issue, splitting out the four different register types could help -- except that I'd guess that most code with excessive register usage probably uses one type of register much more than the rest. There's definitely weighting, but when you get up to the point where size matters the split may still be worth it. My Evil Program here has: P temps: 26951 S temps: 2099 I temps: 8878 Definitely skewed, but going from n^2 (27838^2) to i^2+p^2+s^2 looks like it'd make quite a difference. Anyway, I've attached a patch that uses bitmaps instead of SymReg*'s, which should give a factor of 32 size reduction. I've only tested it by doing a 'make test' and verifying that the several dozen test failures are the same before and after (I don't think things are actually that broken; I think the make system is), but for all I know it's not even calling the code. That's what you get when I only have a two hour hacking window and I've never looked at the code before. I applied this patch, though it didn't change the results on my nasty programs. The size reduction seems worth it, though. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: register allocation
Dan Sugalski [EMAIL PROTECTED] wrote: I'll dig in and see if I can throw that info out. I may also add a counter to see how many trips through the spill loop each program takes. *If* it doesn't die, you can run parrot with -v and you'll get the spill count. leo
Re: register allocation
At 5:35 PM +0200 8/7/04, Leopold Toetsch wrote: Dan Sugalski [EMAIL PROTECTED] wrote: I'll dig in and see if I can throw that info out. I may also add a counter to see how many trips through the spill loop each program takes. *If* it doesn't die, you can run parrot with -v and you'll get the spill count. Alas it does, reliably. Looks like the second loop in imc_reg_alloc is where it goes horribly wrong. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: register allocation
Leopold Toetsch wrote: 4.2) Basic blocks, life span of symbols inside these blocks and loop detection. Baisc block generation depends on two items: * labels - a possible branch target * branching opcodes Detection of branching opcodes depends currently in a not too obvious way[1] on directives in ops/*.ops like: goto NEXT() # no branch, just execute next instruction goto ADDRESS(next) # a branch Now all find_global and find_lex opcodes *did* use goto ADDRESS(next). This might be some relict of experiments with exceptions. These opcodes might branch elsewhere, if an exception had occured. But its not a branch. I've change these opcodes to use goto NEXT() and excluded PARROT_JUMP_ENEXT from considering this opcode as a branch. This fix should vastly improve PIR programs that make heavy use of find_lex or find_global opcodes (hello Dan). leo [1] see lib/Parrot/OpsFile.pm line 525ff
Re: Register allocation bug in IMCC? Or misunderstanding on my part?
Clinton A. Pierce [EMAIL PROTECTED] wrote: The following code: push_integer() not implemented in class 'PerlHash' This is, as far as I can tell, because the same register is used by IMCC for both the READDATA and RESTOREINFO locals, thus by the time that the sub _data gets around to being run, READDATA has become a PerlHash (imcc -t): PC=0; OP=821 (new_p_ic); ARGS=(P0, 14) PC=3; OP=821 (new_p_ic); ARGS=(P0, 15) Yep. Good analysis. You could tell it a bug. Imcc should follow the code path into the subroutine. But he problem is: the subroutine isn't known, when _main is compiled. So IMCC only sees one usage of the Array and one usage of the Hash amd assigns both to the same register P0. If you want really use such constructs, you can't put them in different compilation units, because they are basically one unit. Much better is to implement some kind of calling conventions. As soon as you pass the READDATA array to the subroutine, it will have its own register and the problem is gone. leo
Re: Register allocation bug in IMCC? Or misunderstanding on my part?
At 06:57 PM 6/1/2003 +0200, Leopold Toetsch wrote: Clinton A. Pierce [EMAIL PROTECTED] wrote: The following code: push_integer() not implemented in class 'PerlHash' This is, as far as I can tell, because the same register is used by IMCC for both the READDATA and RESTOREINFO locals, thus by the time that the sub _data gets around to being run, READDATA has become a PerlHash (imcc -t): PC=0; OP=821 (new_p_ic); ARGS=(P0, 14) PC=3; OP=821 (new_p_ic); ARGS=(P0, 15) Yep. Good analysis. You could tell it a bug. Imcc should follow the code path into the subroutine. But he problem is: the subroutine isn't known, when _main is compiled. So IMCC only sees one usage of the Array and one usage of the Hash amd assigns both to the same register P0. Makes sense. If you want really use such constructs, you can't put them in different compilation units, because they are basically one unit. So a BASIC IMCC program, with all of the builtin functions it needs, plus the actual user program itself as one compilation unit? All for the sake of a few needed globals? Sounds ghastly. Much better is to implement some kind of calling conventions. Except that these represent true global data structures that the entire BASIC machine needs. They're not really things to be passed around from place to place and the thought of having a Px register to hold the state of BASIC and passing it (potentially) everywhere seems kinda silly for a language that isn't really threaded in any incarnation and is ever only going to need one state. I've *got* good, sound calling conventions now but there are things (sofar, only half a dozen*) that exist in an outermost scope that need initialization. I could turn the whole thing upside-down (subs first, main last) except that I'd need a token outer main() to call the initialization routine. And from a quick test IMCC can tell that the execution path jumped down, and optimizes those registers away anyway. Nevermind. Further suggestions, given better info? * Examples of things that are really global in a BASIC machine: current random number seed, cursor column position, console current character attributes, last value picked up from a READ/DATA combo (line position), structure definition table, and a GOSUB call-stack lookback specifically for RETURN x.
Re: Register allocation bug in IMCC? Or misunderstanding on my part?
If you want really use such constructs, you can't put them in different compilation units, because they are basically one unit. So a BASIC IMCC program, with all of the builtin functions it needs, plus the actual user program itself as one compilation unit? All for the sake of a few needed globals? Sounds ghastly. No. Global things shouldn't be in registers in the first place, because you're using up your computation resources for things that have a very long lifetime. Much better is to implement some kind of calling conventions. Except that these represent true global data structures that the entire BASIC machine needs. They're not really things to be passed around from place to place and the thought of having a Px register to hold the state of BASIC and passing it (potentially) everywhere seems kinda silly for a language that isn't really threaded in any incarnation and is ever only going to need one state. If you use them often... then maybe one or two need a real register, but I'd still be weary of doing that. Use find_global and its friends. I've *got* good, sound calling conventions now but there are things (sofar, only half a dozen*) that exist in an outermost scope that need initialization. I could turn the whole thing upside-down (subs first, main last) except that I'd need a token outer main() to call the initialization routine. And from a quick test IMCC can tell that the execution path jumped down, and optimizes those registers away anyway. Nevermind. Further suggestions, given better info? * Examples of things that are really global in a BASIC machine: current random number seed, cursor column position, console current character attributes, last value picked up from a READ/DATA combo (line position), structure definition table, and a GOSUB call-stack lookback specifically for RETURN x. Yeah, most of those seem like they can be stored in the global name table instead of a register. Load them in when you need them. Except the latter, which seems like you should be using the call stack, if I'm not misunderstanding. Luke
Re: Register allocation bug in IMCC? Or misunderstanding on my part?
CAP == Clinton A Pierce [EMAIL PROTECTED] writes: CAP Yes, my friend. Continue to use quotes around the word language CAP when describing this mess. Without RETURN x, just plain old GOSUB/RETURN I could let PIR/PASM handle all by itself with no supervision. Or you handle normal subs like that and generate special code for return x? CAP I could, except that it's impossible to tell where exactly the exit CAP point of a subroutine (for GOSUB) is in a BASIC program. In fact, CAP all GOSUB destinations could use an unconditional branch to all use CAP a common RETURN point. You just return when you stumble across a CAP return. Further analysis of where the subroutines are may be CAP impossible with mechanical (or human) analysis at compile time: CAP 5 rem Valid but seriously spaghettified BASIC example. CAP 7 rem Good luck analyzing this at compile time for entry CAP 9 rem and exit points for the subroutines. CAP 10 print Hello, world! CAP 20 for i = 1 to 3 CAP 30 gosub 70 CAP 40 next CAP 50 end CAP 70 gosub mysub CAP 80 gosub 110 CAP 90 return CAP 100 return 30 CAP 110 rem alternate entry point CAP mysub: CAP print Hello, galaxy! CAP on int(rnd(2)+1) goto 90, 100 okay, i am going to jump in and try to drown here. :) why don't you manage your own basic call stack in an array or PerlArray or something? trying to map that mess of call/return poo onto a proper compiler with register allocation is going to lose us the services of leo while he recuperates at the hospital for homicidal BASIC coders. you will just have to bite the bullet and emulate your own virtual engine over parrot and not try to map BASIC directly to it. so code a VM with BASIC flow control. you can probably do something odd like make each BASIC statement into a parrot sub. then just compile your basic into a simple flow control IL that just calls parrot subs. the subs all have access to your need globals (with the obvious find_global/set_global :) a primary global now is the BASIC VM PC. so your code gen is much easier too. each statement is codegenned and you keep their label or sequence numbers in your flow tree and they map to a parrot sub address (that you generate as you codegen). later to execute you just scan your own flow tree (that is the basic interpreter engine), call each parrot sub (BASIC statement)in turn and let parrot to the real work in each statement. BASIC calls and returns are handled with munging the global BASIC PC and the global stack. feel free to tell me this is total cocky-pop. :) uri -- Uri Guttman -- [EMAIL PROTECTED] http://www.stemsystems.com --Perl Consulting, Stem Development, Systems Architecture, Design and Coding- Search or Offer Perl Jobs http://jobs.perl.org