On Wed, May 09, 2012 at 06:39:25PM +0200, Dave Long wrote: > (hey Kragen, how did the bytebeat performances go?)
It went well! Although the Gameboy guy before me, running LSDJ, was a really tough act to follow. I uploaded a recording to <http://canonical.org/~kragen/bytebeat/la-cigale.wav.gz>, and I uploaded the software to <https://github.com/kragen/pytebeat>. > Apropos the bootstrapping thread[0], here's another hex loader: > > 0000000: 31c9 bf00 03ba 8a01 b40a cd21 a18b 013c 1..........!...< > 0000010: 047c 17b8 0001 01c8 bb00 0202 1e8c 0189 .|.............. > 0000020: 07be 8c01 a5a5 9041 ebdb 31c0 a320 0289 .......A..1.. .. > 0000030: cd31 c9be 0003 bf00 02bb 0100 31c9 31d2 .1..........1.1. > 0000040: b800 42cd 21b8 0040 cd21 ac31 d231 c0ac ..B.!..@.!.1.1.. > 0000050: 3c20 740f bb01 0101 cb29 da01 f889 c38b < t......)...... > 0000060: 0701 c231 c0ac 0c20 d410 d503 2c09 c0e0 ...1... ....,... > 0000070: 0401 c2ac 0c20 d410 d503 2c09 01c2 9090 ..... ....,..... > 0000080: b402 cd21 4139 e975 c1c3 5000 ...!A9.u..P. Here's the disassembly, for my benefit and for whoever else is reading this. kragen@VOSTRO9:~/devel$ objdump -m i8086 -b binary --adjust-vma=0x100 -D loader.com loader.com: file format binary Disassembly of section .data: 00000100 <.data>: 100: 31 c9 xor %cx,%cx 102: bf 00 03 mov $0x300,%di 105: ba 8a 01 mov $0x18a,%dx 108: b4 0a mov $0xa,%ah 10a: cd 21 int $0x21 10c: a1 8b 01 mov 0x18b,%ax 10f: 3c 04 cmp $0x4,%al 111: 7c 17 jl 0x12a 113: b8 00 01 mov $0x100,%ax 116: 01 c8 add %cx,%ax 118: bb 00 02 mov $0x200,%bx 11b: 02 1e 8c 01 add 0x18c,%bl 11f: 89 07 mov %ax,(%bx) 121: be 8c 01 mov $0x18c,%si 124: a5 movsw %ds:(%si),%es:(%di) 125: a5 movsw %ds:(%si),%es:(%di) 126: 90 nop 127: 41 inc %cx 128: eb db jmp 0x105 12a: 31 c0 xor %ax,%ax 12c: a3 20 02 mov %ax,0x220 12f: 89 cd mov %cx,%bp 131: 31 c9 xor %cx,%cx 133: be 00 03 mov $0x300,%si 136: bf 00 02 mov $0x200,%di 139: bb 01 00 mov $0x1,%bx 13c: 31 c9 xor %cx,%cx 13e: 31 d2 xor %dx,%dx 140: b8 00 42 mov $0x4200,%ax 143: cd 21 int $0x21 145: b8 00 40 mov $0x4000,%ax 148: cd 21 int $0x21 14a: ac lods %ds:(%si),%al 14b: 31 d2 xor %dx,%dx 14d: 31 c0 xor %ax,%ax 14f: ac lods %ds:(%si),%al 150: 3c 20 cmp $0x20,%al 152: 74 0f je 0x163 154: bb 01 01 mov $0x101,%bx 157: 01 cb add %cx,%bx 159: 29 da sub %bx,%dx 15b: 01 f8 add %di,%ax 15d: 89 c3 mov %ax,%bx 15f: 8b 07 mov (%bx),%ax 161: 01 c2 add %ax,%dx 163: 31 c0 xor %ax,%ax 165: ac lods %ds:(%si),%al 166: 0c 20 or $0x20,%al 168: d4 10 aam $0x10 16a: d5 03 aad $0x3 16c: 2c 09 sub $0x9,%al 16e: c0 e0 04 shl $0x4,%al 171: 01 c2 add %ax,%dx 173: ac lods %ds:(%si),%al 174: 0c 20 or $0x20,%al 176: d4 10 aam $0x10 178: d5 03 aad $0x3 17a: 2c 09 sub $0x9,%al 17c: 01 c2 add %ax,%dx 17e: 90 nop 17f: 90 nop 180: b4 02 mov $0x2,%ah 182: cd 21 int $0x21 184: 41 inc %cx 185: 39 e9 cmp %bp,%cx 187: 75 c1 jne 0x14a 189: c3 ret 18a: 50 push %ax ... > The main advantage of keying in the extra 100 bytes is to gain > branch relocation with (mono-)symbolic labels, facilitating > alternations and repetitions, and hence paving the way for real (or > at least multiple character) symbol tables and automated handling of > other relocations[1]. Its format is fairly restrictive[2], with the > first 4 characters of each line defining an output byte: > 0: defines a label > 1: calculates a pc-relative reference to a label > 2: high nybble value > 3: low nybble value That's pretty impressive for 140 bytes! And probably a much more practical approach than hand-assembling the next stage of everything on paper before entering it into the loader. Octal might be more practical than hex, though. It does sort of assume you already have an interactive text editor, though. (Or, say, a paper tape reader, a knife, and Scotch tape.) But, if you had to write an interactive editor, maybe it would make sense to write a binary editor first, rather than a text editor? That was the approach I was taking in that undebugged 78-byte editor in the tail end of my original post. > As an example, here's a load file for a variant of fr-16: > b8 > 13 > 00 # mov ax,13 > cd > 10 # int 10 > c4 > 2f # les bp,[bx] > l aa # loop: stosb > 11 > f8 # adc ax,di > 15 > 32 > 11 # adc ax,1132 > eb > l00 # jmp loop > (or see the footnotes for the "source"[3]) Very nice example. Did you come across a variant of DOS where AX was nonzero at startup? > In principle, there's one branch too many[4] in this code; it should > be possible to run the second pass entirely in straight-line code. > On the other hand, 15 additional bytes and slight modifications to > the input format suffice to provide absolute relocations without any > additional logic... :) The global assembler label is a remarkably powerful tool. I wonder, though, if it might be a bit too powerful; perhaps the Forth approach (two separate stack-based mechanisms for backward and forward jumps, plus an updatable symbol table for calls to earlier-defined routines, giving each symbol a scope from its own definition until the next definition of the same name) is actually better. The code to handle those constructs in Stoneknifeforth, which also uses single-letter symbols, is as follows. First, the code for the calls to earlier-defined routines (and variables): : Type Four* header 6144 + + ; ( Table of definition Types: 1=code, 2=data ) : Addr Type 1024 + ; ( table of definition Addresses ) ( register a colon definition ) : Colon %flush dup Type 1 xchg ! HERE @ xchg Addr ! ; ( register a variable definition ) : Var %flush dup Type 2 xchg ! HERE @ xchg Addr ! ; ( Stores word on top of stack at HERE, praise be for non-aligned access. ) : Encode %flush HERE @ ! 4 forward ; : Lit 80 . ( push %eax ) 184 . ( load immediate %eax ) Encode ; ( literal) ( Compiles a user-defined subroutine or variable, using Type and Addr ) : &dispatch dup Type @ 1 = [ ( if it’s a subroutine...) Restack 232 . ( CALL instruction; now we have to compute the offset ) HERE @ 4 + ( find address of next instruction ) xchg Addr @ xchg - ( find offset to subroutine ) Encode Restack ; ] ( okay, and otherwise, it’s a variable... ) Addr @ addr Lit ; ("Restack" here switches between the operand stack and the return stack. As an optimization, it has a single-instruction peephole buffer so that Restack Restack is a no-op, and %flush flushes that buffer. So this code could be even simpler.) Then jumps forward: ( JZ: compile a forward conditional jump ) : JZ 133 . 192 . ( test %eax, %eax ) 88 . ( pop %eax ) 116 . HERE @ 0 . ( jz 0 ) ; ( Patch: resolve a forward conditional jump by backpatching ) ( 1 - is needed because the saved address is one byte inside the jnz instruction, but the offset should be calculated from the beginning of the next instruction ) : Patch dup %flush HERE @ xchg - 1 - xchg store ; Then jumps backward: ( jnz: compile a backward conditional jump ) : jnz 133 . 192 . 88 . 117 . H @ - 1 - . ( jnz whatever ) ; The forward and backward jumps are invoked by these cases in the Dispatch routine: dup '[ = [ pop JZ ; ] ( [ compiles a conditional jump, pushing an address for backpatching ) dup '] = [ pop Patch ; ] ( backpatches the address on the compile stack ) dup '{ = [ pop %flush HERE @ ; ] ( merely pushes an address to jump back to ) dup '} = [ pop jnz ; ] ( compiles a conditional jump to that address ) It seems to me like the extra complexity of having those jumps use a stack rather than the symbol table is almost nil (like maybe one line of code), and it means that the symbol table doesn't have to support forward references, and can also handle multiple definitions with the same name. (I really should get around to making SKF run on modern Linuxes again. There's some kind of corner it's cutting in ELF generation that causes the binaries it generates to no longer load, and the ELF loader is not forthcoming with error messages...) Barely relatedly, linking is apparently a progressively bigger bottleneck in builds of big C and even C++ projects, because you can compile each source file on a separate core, but then the linker has to deal with the entire corpus at once. It occurred to me that a mapreduce approach to linking ought to be quite parallelizable: 1. Map: Break your input files up into (virtual address, symbol name, relocation type, code snippet) tuples, where snippets of literal code are either dragged along with the last relocation before them or associated with a null symbol; 2. Sort these by symbol name; 3. Reduce: For each set of tuples for a given symbol name, resolve the "U" unresolved relocations using the defining relocation (or throw an error), producing a new set of (virtual address, code snippet) tuples (augmented, I suppose, by some metadata); 4. Sort these by virtual address; 5. Concatenate them to produce the final program. It seems like that approach ought to provide some kind of speedup for sufficiently large programs. > [1] in fact, if one adds enough evaluation to an assembler it > becomes indistinguishable from a linker ... conversely, a > sufficiently advanced linker is indistinguishable from an assembler. Indeed. And the reason that compilers are called "compilers" is that many of the early compilers had mostly the job of collecting together the appropriate set of library routines, providing the programmer a slightly higher level instruction set for input that was essentially machine code. For some decades, some people tried to promote the more accurate term "translators", but that seems to have failed by 1980. Kragen -- To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-discuss
