On Sun, Nov 26, 2006 at 08:30:32PM -0800, Allison Randal wrote: > I had to poke into the guts of HLLCompiler, the new PAST, and the new > POST a fair bit in the process of getting Punie to work with them, so my > comments here are a mixture of user experience and implementation > details. I've grouped my comments into general categories.
Excellent. Just as a general overall response -- PAST-pm is by no means "finished", so many of the items that seem to be missing are simply cases of "I haven't gotten to them yet so they aren't implemented yet." > Available node types: > > - There's no PAST::Stmt node type? I only see PAST::Stmts and PAST::Op. > But statements are composed of multiple ops. So, everything is an op? At present there's no PAST::Stmt node type, but one can be easily added. I thought about putting one in based on Punie's use of PAST::Stmt, but I hadn't quite figured out exactly _why_ it's important so I thought I'd leave it out until I actually needed it somewhere. In many ways ops are already composed of multiple ops, so a statement can be considered just another op. (But I do see why someone would want a PAST::Stmt abstraction -- on the other hand, I didn't see how it changed the resulting POST/PIR output.) > - There's no PAST::Label node type? How do you represent labels in the > HLL source? I just haven't gotten to this part yet. > - Is there no way to indicate what type of variable a PAST::Var is? > Scalar/Array/Hash? (high-level types, not low-level types) Sure, that's what 'vtype' is -- it indicates the type of value that the variable ought to hold. My plan has been to follow the Perl6 concept of "implementation types" and "value types" within PAST. Thus far I've only put in the support for the value types, as the "vtype" attribute (and vtype can be any high-level type the language happens to support). I'm expecting to add an "itype" attribute at some point when we're a bit farther along; I'm still working out the details. > --- > Meaningful naming: (Be kind to your compiler writers.) I totally agree, and I'm not yet wedded to any particular naming scheme. > - In the PAST nodes, I grok 'name' as the operator/function name of a > PAST::Op and as the HLL variable name of a PAST::Var, but making it the > value of a PAST::Val is going to far. It was 'value' in the old PAST, > which makes more sense. You're passing named parameters into 'init', so > I can't see a reason not to use a more meaningful name for the attribute. I don't have a problem with switching it to 'value', I went with 'name' primarily because every PAST::Node has a name and so it just made sense to use it there. But let me make another weak argument in favor of 'name'. If a HLL programmer writes $a = 1.23456789E6; then the rhs becomes a PAST::Val node. How should we represent the value? The parse-to-past translation could evaluate the contents of "1.23456789E1" and store the result in 'value' as (.Float) 12.3456789, but unfortunately when convert that .Float back into a string for use as PIR code it comes out as "12.3457" -- i.e., the code looks like: $P0 = new .Float $P0 = 12.3457 set_global '$a', $P0 I decided that in a number of cases like this, what we really want to retain in PAST::Val is a precise string representation of the value that goes in the resulting output, and not a native representation that may lose precision in translation through POST/PIR. So, what we're really storing is the value's "name" and not its "value". (I did say it was a weak argument.) Anyway, we can switch to 'value' if that's ultimately better; I was just thinking that 'name' might be equally appropriate. > - In PAST nodes, the attribute 'ctype' isn't actually storing a C > language type. Better name? It really stands for "constant type", and is one of 'i', 'n', or 's' depending on whether it can be treated as an int, num, or string when being handled as a constant in PIR. > - The attribute 'vtype' is both variable type in POST::Var and value > type in POST::Val. Handy generalization, but it's not clear from the > name that 'vtype' is either of those things. I think you meant PAST::Var/PAST::Val here, as there isn't a POST::Var or POST::Val. But 'vtype' really stands for "value type" in both cases -- it's the type of value returned by either a PAST::Var or PAST::Val node. > - The values for both 'ctype' and 'vtype' are obscure. Better to > establish a general system for representing types, than to include raw > Parrot types or 1-letter codes in the AST. Ultimately I expect that the types that appear in 'vtype' will be the types defined by the HLL itself. For example, in perl6 one would see 'vtype'=>'Str' to indicate a Perl 6 string constant. Unfortunately it's been difficult to illustrate this in real code because of the HLL classname conflicts that I've been reporting in other contexts. I agree the values and name for 'ctype' are a bit obscure, and will gladly accept any suggestions for improving it. The 'ctype' attribute is really just code optimization in the final output, and it does assume some knowledge of the target. If no ctype is specified, past-pm assumes that the constant value must first be placed into a PMC in order to be useful. With a ctype present, then past-pm can match up the (PIR) opcode contexts in which the value can be directly used as an int/num/string in an operation. It's the difference between # $b + 2 # $b + 2 get_global $P0, '$b' get_global $P0, '$b' new $P2, .Undef new $P1, .Integer add $P2, $P0, 2 assign $P1, 2 new $P2, .Undef add $P2, $P0, $P1 or # say 3, 4, 5 # say 3, 4, 5 "say"(3, 4, 5) new $P1, .Integer assign $P1, 3 new $P2, .Integer assign $P2, 4 new $P3, .Integer assign $P3, 5 "say"($P1, $P2, $P3) > - In PAST nodes, consider the audience when choosing attribute names > like 'ismy' (PAST::Var). Something like 'islexical' or 'isdeclaration' > (I'm not sure which you mean), is friendlier to non-Perl users, and > actually clearer even for Perl users. "Ismy" means "isdeclaration" here, and I can go ahead and change it. > - In PAST nodes again, I'm not clear on what 'pirop' (PAST::Op) > represents. Is it the literal name of a PIR opcode, or a generic > representation of standard low-level operations? I'm more in favor of > the latter. Better still, give compiler-writers a standard format lookup > table they can write to allow the PAST to POST tranformation to select > the right PIR operation from the HLL op name. (See the comments on > boundaries of abstraction.) I think past-pm already has exactly what you want here, but it may not be entirely clear. First, 'pirop' does exactly what you request in 'Better still, ...' -- it provides a way for the compiler writer to identify the right PIR operation from the HLL op name. In particular, in the operator-precedence specification a compiler writer writes: proto infix:+ is pirop('add') { ... } proto infix:- is pirop('sub') { ... } and this provides an easy way for the parse-to-past transformation to associate the correct PIR operation from the HLL op name. Essentially, the transformation looks for a 'pirop' trait on the operator, and if found it puts it in the 'pirop' attribute of the corresponding PAST::Op node. The values of 'pirop' are really generic representations of standard low-level operators. Unfortunately, PIR is not as regular as we might like it to be -- some PIR operations will work only with pmc operands, some will work with a variety of int/num/string/pmc operands, and still others won't work with pmcs at all. So, POST.pir has a lookup table (%pirtable) that takes the generic name given by 'pirop' and does any necessary coercions to get the operands to match. So far this table is incomplete -- I've been adding entries only as I need them. The idea is that a compiler writer can use 'pirop' to specify the mapping of HLL operators into PIR opcodes directly in the grammar files where the HLL operators are being defined. Furthermore, the compiler writer doesn't have to keep track of the low-level details for each PIR opcode; i.e., when specifying 'pirop'=>'concat' the past-pm code generation knows that concat needs string oprands. (However, if a compiler writer needs a specific PIR opcode, then they could specify it with something like 'pirop'=>'concat_p_sc'.) > - In PAST nodes, the 'clone' method is now 'init'.'clone' was a terrible > name, I agree, but 'init' isn't quite right either. Currently Parrot uses '__init' as the method for initializing new objects, thus I think 'init' is at least consistent with Parrot. > - In PAST nodes, the 'add_child' method is now 'push'. I liked > 'add_child' better, but, maybe what we really want is not a method at > all, but a :vtable entry for an array push? Seems likely, since there's > really not any other array-like behavior the syntax-tree nodes need to have. I went with 'push' because I know I'm also going to want an 'unshift' operation, and those seemed more descriptive than the generic 'add_child'. They also correspond well to the Parrot ops. I've also thought about doing 'push' as a :vtable entry, and we can still easily do that, but there are at least two items in favor of keeping a method-based approach: (1) :vtable in subclassed items still has some issues to be addressed (e.g., RT #40626), and (2) when we get a high-level transformation language into TGE, it's very likely that the operations on nodes will be method-based and not opcode-based. > --- > > Clear boundaries between components: (Fuzzy boundaries of abstraction > make it difficult to allow for other implementations of the AST/OST or > customization of the compiler object.) > > - The 'compile' method doesn't belong in the PAST object, it belongs in > HLLCompiler. > ... After a lot of thought and false starts, I ended up taking a different approach to compilation than the "HLLCompiler specifies the complete sequence of transformations". Essentially I've taken the approach that a "compiler" is simply something that transforms a source data structure into a target data structure, and so what we really have is a sequence of "compilers". To this end, I really wanted to call my compiler base class 'Compiler' and not 'HLLCompiler', but unfortunately that classname is already used by Parrot for something else and so 'HLLCompiler' is what I chose until that could be resolved. The 'HLL' probably implies more than I intended to imply. So, the 'Abc' compiler really is just something that converts the 'bc' language into a PAST structure, after doing that it simply hands the result off to the 'PAST-pm' compiler. Similarly, the 'PAST' compiler translates into POST and hands the result off to the POST compiler, and POST simply does its thing and returns a PIR or executable result. > - The 'compile' method also doesn't belong in the main compiler > executable, it belongs in HLLCompiler. > - Merge them into one 'compile' method in HLLCompiler. > - Customization of HLLCompiler should be handled by creating a subclass > of HLLCompiler. (The current 'register' strategy is somewhat fragile.) I don't have any problem with having each language subclass HLLCompiler and override the 'compile' method in each, I'll work on that soon. Of course, the method still ends up one way or another in the main compiler executable, it may simply change the namespace. > - Provide an 'init' method for HLLCompiler that lets the compiler writer > set which modules HLLCompiler will use for each stage of compilation. > This will cover the majority of compilers without requiring each > compiler writer to define their own 'compile' routine. Because of the multi-stage approach I've taken, the compile routines are already fairly short, and to me they're not at all onerous for a compiler writer to create. For each of languages/abc/, languages/APL/, and languages/perl6/ the 'compile' method is less than 30 lines of PIR. (And it will only require a couple of lines of code to abstract the existing call to 'compile' methods of PAST/POST to instead use PAST/POST compilers.) I also think that many compilers may end up with compiler-specific option flags or other items that need to be taken care of, and it seems to me that this is more easily handled by a method definition than a module specification. > - It would be easier to maintain (and create) the list of HLL to PIR > operator associations in something like a YAML file than embedded in the > parser grammar file. [...] Hmm. My feeling was that it was easier to put the operator associations in the parser grammar file, but I can see the value of placing them somewhere else, and I definitely would like to keep Parrot-isms out of the AST as much as possible. OTOH, there are many times when for optimization reasons or other items it's useful to be able to drop some Parrot hints directly into the AST (e.g., the 'ctype' attribute above), and so I think that as long as full program semantics are captured in the AST without any Parrot-specific items, it's okay to have Parrot-specific items available in the AST as compiler hints simply because it's sometimes easier to place them there than elsewhere. > At the very least, the 'pirop' property on parser rules could be handled > by the PAST-to-POST transformation, so the compiler writer doesn't have > to manually pull those values out of the parser grammar's optable when > creating the AST. Agreed -- I'll work on this. > (If the parser grammar module was specified in > HLLCompiler's 'init', then the compiler object would know where to look > for the optable.) I'm thinking this is really a parameter to the AST compiler, along with some useful support routines to make it easy to grab this information from a YAML file or other source. (In fact, it would be easy to supply a utility routine to construct the HLL operator -> pirop translation from the optable, so that compiler writers that wanted to specify the pirop in the parser grammar could easily do so and still not have to manually pull the values across.) > --- > Refactor into smaller units: (Easier to test, easier to maintain, easier > to debug, and easier to create subclasses.) > > - In HLLCompiler, instead of a monolithic 'command_line' method, split > the components into independent methods all called from the > 'command_line' method. Agreed; HLLCompiler was thrown together quickly, and it was done before it was even clear that we would have a standard compiler object. > - In HLLCompiler, split the 'compile' method out into independent > methods for each compilation stage ('compile_ast', 'compile_ost', > 'compile_pir', etc.), all called from 'compile'. Again, I tend to think of this as being all separate compilers, each of which automatically call its default next stage until compiler options tell it to do otherwise. > --- > Useful error detection: (Be kind to your compiler writers.) > > - Returning the source code string from the 'compile' method in > HLLCompiler when no compiler is registered isn't helpful. Throw an > exception or give an error. Give the compiler writer a clue to what's wrong. I just threw in a quick default, and yes, throwing an exception would be a better default. Sorry. > - Provide distinct errors/exceptions for failures at each stage of > compilation to make it easy to figure out which stage is failing. Agreed -- however, exception handling in Parrot still needs implementation and better flushing out (this is what prompted my question about the status of exception handling implementation in last week's #parrotsketch, and my comment that I'm likely to need them fairly soon.) > --- > Side comments: > > - In PGE grammars, what is the "{ ... }" at the end of every proto > declaration supposed to do? It seems like a dead weight. Is that where > the Perl 6 code defining the operator supposed to go? Can we allow > 'proto' declarations to end in a semi-colon when there is no Perl 6 code? PGE grammars are trying to closely follow the Perl 6 syntax, so the "{ ... }" is the Perl6 "yada-yada-yada" body that gets used in function prototypes. And yes, I could see some compilers choosing to specify the code directly in the declaration (which is why 'proto' can also be 'sub' or the other Perl 6 subroutine keywords). But in the end, I didn't allow simple semicolon terminators simply because it wasn't valid Perl 6 syntax, and in many cases I think that having subtle differences isn't ideal as people may get confused about what is allowed where. But I don't have a large objection to modifying the PGE::Grammar compiler to represent empty declarations with semicolons as well as yada-yada-yada blocks. Pm