Hi Folks, If you aren't a C++ programmer, this may not be interesting. If you are a proselytising functional programming advocate, this may only reinforce your preconceptions.
I am working on getting all of the cross assembler and cross linker functionality into my UCSD Pascal cross compiler package. This is definitely a challenging portion of the project. The sources of the native UCSD Linker say "Abandon All Hope Ye Who Enter Here", and they weren't kidding. In order to get the cross compiler to emit linkage information for external UNIT variables, I have to jigger the code generation to (a) always produce a 2-byte Big reference (LAO, LDO, STO) and never optimise to SLDO etc, and (b) emit linkage information. The second part is the easiest. It is mangling the code generation for the special case of an unknown global variable offset that is nasty. And then, if that wasn't enough, intrinsic units can have external data segments with completely different opcodes (LAE, LDE, STE). To explain the next bit, I have to back up to 2006, when I wrote a paper called "Compilers and Factories" for LCA2007. The central theme of that paper was to pass decisions and control of how to manipulate the expression trees to the expression tree objects. An example: The Pascal grammar has statement = expression ':=' expression This results in unhelpful "Syntax error" errors when, if like me you have been coding in C or C++ for the intervening 30 years, you write x = 5; In order to get a more helpful error message, the trick is to move the error out of the grammar, and into the semantics. (This is what C does, except C is more generous). The grammar is changed to statement = expression Of course, if the expression actually has a non-nothing result, it is an error. In the assignment example, above, the error message the cross compiler gives is statement expression is a boolean value, it should be nothing; did you mean to use an assignment (written ":=") instead of an equality test (written "=")? This is a much more helpful error message. But now we can have expressions on both the left and the right hand sides of an assignment. How do we know which opcodes (loads or stores) to generate? Well, the cross compiler uses abstract syntax trees, rather than generating code as it is parsed... we have *much* more memory to play with than the UCSD native Pascal compiler ever did. To produce the assignment, the compiler could grope the left hand expression, and do different code branches for global stores, array index stores, record field stores, etc. But the approach taken is different: it simply asks the left hand side to turn itself into an assignment, the yacc grammar looks like this: expression: expression ASSIGN expression { $$ = $1->assignment_factory($3); } As you can see, no groping of the left hand side ($1) is required. The default implementation of the virtual assignment_factory method is to say "inappropriate assignment". Thus, a simple variable load object creates a new variable store expression object, an array index load object creates a new array store expression object, etc. The same technique can be used to handle array indexing, "dot" expressions, and function and procedure calls. Yes, but what does this have to do with variables? The cross compiler's yacc grammar has a production like this: expression: NAME { $$ = name_expression_factory($1); } This operates under the assumption that it is probably on the right hand side, and generates load expressions. This name_expression_factory did a chain of {if then else if then else...} tests to decide what to do, all involving nasty C++ down casts, which makes my skin crawl, because too often it's bug in hiding. expression::pointer name_expression_factory(symbol::pointer sp) { symbol_constant::pointer scp = boost::dynamic_pointer_cast<symbol_constant>(sp); if (scp) { return scp->get_value(); } symbol_variable::pointer svp = boost::dynamic_pointer_cast<symbol_variable>(sp); if (scp) { return expression_load_indirect::create(expression_address_local::create(svp->get_offset())); } ...etc // uglier than this, but you get the idea } Then (four years later) it occurs to me: let the symbol create the name expression. expression::pointer name_expression_factory(symbol::pointer sp) { return sp->name_expression_factory(); } and moving each test case into the symbol derived classes implementation of name_expression_factory(). No down casts, either. Yes, but what does that have to do with external linkage, or variables in DATA segments? Let's take the external DATA segment case first: Further derive the symbol_variable class, so that we have symbol_variable_external expression::pointer symbol_variable_external::name_expression_factory() { return expression_load_indirect::create( expression_address_external::create(segnum, offset)); } and the "normal" case of a function's local variables expression::pointer symbol_variable_local::name_expression_factory() { return expression_load_indirect::create( expression_address_local::create(offset)); } etc. There is extra machinery setting up the classes (and C++ is hideously verbose is this regard) but once done, no more type-flakey and expensive "what is this" tests, more readable, and it goes faster. Yes, but how does that address the "global variable of unknown offset in a (non-intrinsic) unit" case? Another derived class, of course. expression::pointer symbol_variable_globref::name_expression_factory() { // the "globref" name is taken from the kind // for linkage information to emit. return expression_load_indirect::create( expression_address_globref::create(name)); // that "name" is the variable's name, an instance // variable of the symbol base class } Now, the LDO=>SLDO optimisations are not done by the expression_address_globref class (oh, um, did I mention that expression objects know how to optimize themselves?) because we can't know if the offset<=16. And the opcode's Big offset is always generated as two bytes, in case offset>=128, avoiding the usual optimising code path for Big offsets. The expression_address_global class, of course, continues to optimise as before, because it *does* know it's offset. So that's all. The lightning bolt was to realise that I wasn't using some techniques I was already using elsewhere in the compiler, and that inconsistency was making for painful thinking about how to solve the problem in an elegant manner. So painful that I worked on something else for a while. That is one of my favourite aspects of open source, especially on projects I'm doing for myself: you can take the time to do it right. Regards Peter Miller <pmil...@opensource.org.au> /\/\* http://miller.emu.id.au/pmiller/ PGP public key ID: 1024D/D0EDB64D fingerprint = AD0A C5DF C426 4F03 5D53 2BDB 18D8 A4E2 D0ED B64D See http://www.keyserver.net or any PGP keyserver for public key. "It's my crack pipe, and I can put anything in it I want to." -- Erik de Castro Lopo _______________________________________________ coders mailing list coders@slug.org.au http://lists.slug.org.au/listinfo/coders