On Wed, May 09, 2012 at 06:39:25PM +0200, Dave Long wrote:
> (hey Kragen, how did the bytebeat performances go?)

It went well!  Although the Gameboy guy before me, running LSDJ, was a really
tough act to follow.  I uploaded a recording to
<http://canonical.org/~kragen/bytebeat/la-cigale.wav.gz>, and I uploaded the
software to <https://github.com/kragen/pytebeat>.

> Apropos the bootstrapping thread[0], here's another hex loader:
> 
>     0000000: 31c9 bf00 03ba 8a01 b40a cd21 a18b 013c  1..........!...<
>     0000010: 047c 17b8 0001 01c8 bb00 0202 1e8c 0189  .|..............
>     0000020: 07be 8c01 a5a5 9041 ebdb 31c0 a320 0289  .......A..1.. ..
>     0000030: cd31 c9be 0003 bf00 02bb 0100 31c9 31d2  .1..........1.1.
>     0000040: b800 42cd 21b8 0040 cd21 ac31 d231 c0ac  ..B.!..@.!.1.1..
>     0000050: 3c20 740f bb01 0101 cb29 da01 f889 c38b  < t......)......
>     0000060: 0701 c231 c0ac 0c20 d410 d503 2c09 c0e0  ...1... ....,...
>     0000070: 0401 c2ac 0c20 d410 d503 2c09 01c2 9090  ..... ....,.....
>     0000080: b402 cd21 4139 e975 c1c3 5000            ...!A9.u..P.

Here's the disassembly, for my benefit and for whoever else is reading this.

    kragen@VOSTRO9:~/devel$ objdump -m i8086 -b binary --adjust-vma=0x100 -D 
loader.com

    loader.com:     file format binary


    Disassembly of section .data:

    00000100 <.data>:
     100:   31 c9                   xor    %cx,%cx
     102:   bf 00 03                mov    $0x300,%di
     105:   ba 8a 01                mov    $0x18a,%dx
     108:   b4 0a                   mov    $0xa,%ah
     10a:   cd 21                   int    $0x21
     10c:   a1 8b 01                mov    0x18b,%ax
     10f:   3c 04                   cmp    $0x4,%al
     111:   7c 17                   jl     0x12a
     113:   b8 00 01                mov    $0x100,%ax
     116:   01 c8                   add    %cx,%ax
     118:   bb 00 02                mov    $0x200,%bx
     11b:   02 1e 8c 01             add    0x18c,%bl
     11f:   89 07                   mov    %ax,(%bx)
     121:   be 8c 01                mov    $0x18c,%si
     124:   a5                      movsw  %ds:(%si),%es:(%di)
     125:   a5                      movsw  %ds:(%si),%es:(%di)
     126:   90                      nop    
     127:   41                      inc    %cx
     128:   eb db                   jmp    0x105
     12a:   31 c0                   xor    %ax,%ax
     12c:   a3 20 02                mov    %ax,0x220
     12f:   89 cd                   mov    %cx,%bp
     131:   31 c9                   xor    %cx,%cx
     133:   be 00 03                mov    $0x300,%si
     136:   bf 00 02                mov    $0x200,%di
     139:   bb 01 00                mov    $0x1,%bx
     13c:   31 c9                   xor    %cx,%cx
     13e:   31 d2                   xor    %dx,%dx
     140:   b8 00 42                mov    $0x4200,%ax
     143:   cd 21                   int    $0x21
     145:   b8 00 40                mov    $0x4000,%ax
     148:   cd 21                   int    $0x21
     14a:   ac                      lods   %ds:(%si),%al
     14b:   31 d2                   xor    %dx,%dx
     14d:   31 c0                   xor    %ax,%ax
     14f:   ac                      lods   %ds:(%si),%al
     150:   3c 20                   cmp    $0x20,%al
     152:   74 0f                   je     0x163
     154:   bb 01 01                mov    $0x101,%bx
     157:   01 cb                   add    %cx,%bx
     159:   29 da                   sub    %bx,%dx
     15b:   01 f8                   add    %di,%ax
     15d:   89 c3                   mov    %ax,%bx
     15f:   8b 07                   mov    (%bx),%ax
     161:   01 c2                   add    %ax,%dx
     163:   31 c0                   xor    %ax,%ax
     165:   ac                      lods   %ds:(%si),%al
     166:   0c 20                   or     $0x20,%al
     168:   d4 10                   aam    $0x10
     16a:   d5 03                   aad    $0x3
     16c:   2c 09                   sub    $0x9,%al
     16e:   c0 e0 04                shl    $0x4,%al
     171:   01 c2                   add    %ax,%dx
     173:   ac                      lods   %ds:(%si),%al
     174:   0c 20                   or     $0x20,%al
     176:   d4 10                   aam    $0x10
     178:   d5 03                   aad    $0x3
     17a:   2c 09                   sub    $0x9,%al
     17c:   01 c2                   add    %ax,%dx
     17e:   90                      nop    
     17f:   90                      nop    
     180:   b4 02                   mov    $0x2,%ah
     182:   cd 21                   int    $0x21
     184:   41                      inc    %cx
     185:   39 e9                   cmp    %bp,%cx
     187:   75 c1                   jne    0x14a
     189:   c3                      ret    
     18a:   50                      push   %ax
        ...

> The main advantage of keying in the extra 100 bytes is to gain
> branch relocation with (mono-)symbolic labels, facilitating
> alternations and repetitions, and hence paving the way for real (or
> at least multiple character) symbol tables and automated handling of
> other relocations[1].  Its format is fairly restrictive[2], with the
> first 4 characters of each line defining an output byte:
>       0:      defines a label
>       1:      calculates a pc-relative reference to a label
>       2:      high nybble value
>       3:      low nybble value

That's pretty impressive for 140 bytes!  And probably a much more practical
approach than hand-assembling the next stage of everything on paper before
entering it into the loader.  Octal might be more practical than hex, though.

It does sort of assume you already have an interactive text editor, though.
(Or, say, a paper tape reader, a knife, and Scotch tape.)  But, if you had to
write an interactive editor, maybe it would make sense to write a binary editor
first, rather than a text editor?  That was the approach I was taking in that
undebugged 78-byte editor in the tail end of my original post.

> As an example, here's a load file for a variant of fr-16:
>   b8
>   13
>   00    # mov ax,13
>   cd
>   10    # int 10
>   c4
>   2f    # les bp,[bx]
> l aa    # loop: stosb
>   11
>   f8    # adc ax,di
>   15
>   32
>   11    # adc ax,1132
>   eb
>  l00    # jmp loop
> (or see the footnotes for the "source"[3])

Very nice example.  Did you come across a variant of DOS where AX was nonzero
at startup?

> In principle, there's one branch too many[4] in this code; it should
> be possible to run the second pass entirely in straight-line code.
> On the other hand, 15 additional bytes and slight modifications to
> the input format suffice to provide absolute relocations without any
> additional logic...

:)

The global assembler label is a remarkably powerful tool.  I wonder, though, if
it might be a bit too powerful; perhaps the Forth approach (two separate
stack-based mechanisms for backward and forward jumps, plus an updatable symbol
table for calls to earlier-defined routines, giving each symbol a scope from
its own definition until the next definition of the same name) is actually
better.

The code to handle those constructs in Stoneknifeforth, which also uses
single-letter symbols, is as follows.  First, the code for the calls to
earlier-defined routines (and variables):

    : Type Four* header 6144 + + ; ( Table of definition Types: 1=code, 2=data )
    : Addr Type 1024 + ;     ( table of definition Addresses )
    ( register a colon definition )
    : Colon %flush dup Type 1 xchg !  HERE @ xchg Addr ! ;
    ( register a variable definition )
    : Var   %flush dup Type 2 xchg !  HERE @ xchg Addr ! ;
    ( Stores word on top of stack at HERE, praise be for non-aligned access. )
    : Encode %flush HERE @ !  4 forward ;
    : Lit 80 . ( push %eax ) 184 . ( load immediate %eax ) Encode ; ( literal)
    ( Compiles a user-defined subroutine or variable, using Type and Addr )
    : &dispatch dup Type @ 1 = [ ( if it’s a subroutine...)
        Restack 232 . ( CALL instruction; now we have to compute the offset )
        HERE @ 4 + ( find address of next instruction )
        xchg Addr @  xchg - ( find offset to subroutine )
        Encode
        Restack ; ] ( okay, and otherwise, it’s a variable... )
      Addr @ addr Lit ;

("Restack" here switches between the operand stack and the return stack.  As an
optimization, it has a single-instruction peephole buffer so that Restack
Restack is a no-op, and %flush flushes that buffer.  So this code could be even
simpler.)

Then jumps forward:

    ( JZ: compile a forward conditional jump )
    : JZ 133 . 192 . ( test %eax, %eax ) 88 . ( pop %eax ) 
         116 . HERE @ 0 . ( jz 0 ) ;
    ( Patch: resolve a forward conditional jump by backpatching )
    ( 1 - is needed because the saved address is one byte inside the jnz 
      instruction, but the offset should be calculated from the beginning of 
the 
      next instruction )
    : Patch  dup %flush HERE @ xchg - 1 -  xchg store ;  

Then jumps backward:

    ( jnz: compile a backward conditional jump )
    : jnz 133 . 192 . 88 . 117 . H @ - 1 - . ( jnz whatever ) ;

The forward and backward jumps are invoked by these cases in the Dispatch
routine:

      dup '[ = [ pop JZ ; ]                 ( [ compiles a conditional jump, 
                                              pushing an address for 
backpatching )
      dup '] = [ pop Patch ; ]     ( backpatches the address on the compile 
stack )
      dup '{ = [ pop %flush HERE @ ; ] ( merely pushes an address to jump back 
to )
      dup '} = [ pop jnz ; ]        ( compiles a conditional jump to that 
address )

It seems to me like the extra complexity of having those jumps use a stack
rather than the symbol table is almost nil (like maybe one line of code), and
it means that the symbol table doesn't have to support forward references, and
can also handle multiple definitions with the same name.

(I really should get around to making SKF run on modern Linuxes again.  There's
some kind of corner it's cutting in ELF generation that causes the binaries it
generates to no longer load, and the ELF loader is not forthcoming with error
messages...)

Barely relatedly, linking is apparently a progressively bigger bottleneck in
builds of big C and even C++ projects, because you can compile each source file
on a separate core, but then the linker has to deal with the entire corpus at
once.  It occurred to me that a mapreduce approach to linking ought to be quite
parallelizable:

1. Map: Break your input files up into (virtual address, symbol name,
relocation type, code snippet) tuples, where snippets of literal code are
either dragged along with the last relocation before them or associated with a
null symbol;
2. Sort these by symbol name;
3. Reduce: For each set of tuples for a given symbol name, resolve the
"U" unresolved relocations using the defining relocation (or throw an error),
producing a new set of (virtual address, code snippet) tuples (augmented, I
suppose, by some metadata);
4. Sort these by virtual address;
5. Concatenate them to produce the final program.

It seems like that approach ought to provide some kind of speedup for
sufficiently large programs.

> [1] in fact, if one adds enough evaluation to an assembler it
> becomes indistinguishable from a linker ... conversely, a
> sufficiently advanced linker is indistinguishable from an assembler.

Indeed.  And the reason that compilers are called "compilers" is that many of
the early compilers had mostly the job of collecting together the appropriate
set of library routines, providing the programmer a slightly higher level
instruction set for input that was essentially machine code.  For some decades,
some people tried to promote the more accurate term "translators", but that
seems to have failed by 1980.

Kragen
-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-discuss

Reply via email to