Here's a little analysis of the disassembly. On Mon, May 21, 2012 at 01:11:18AM -0400, Kragen Javier Sitaker wrote: > On Wed, May 09, 2012 at 06:39:25PM +0200, Dave Long wrote: > > Apropos the bootstrapping thread[0], here's another hex loader: > > > > 0000000: 31c9 bf00 03ba 8a01 b40a cd21 a18b 013c 1..........!...< > > 0000010: 047c 17b8 0001 01c8 bb00 0202 1e8c 0189 .|.............. > > 0000020: 07be 8c01 a5a5 9041 ebdb 31c0 a320 0289 .......A..1.. .. > > 0000030: cd31 c9be 0003 bf00 02bb 0100 31c9 31d2 .1..........1.1. > > 0000040: b800 42cd 21b8 0040 cd21 ac31 d231 c0ac ..B.!..@.!.1.1.. > > 0000050: 3c20 740f bb01 0101 cb29 da01 f889 c38b < t......)...... > > 0000060: 0701 c231 c0ac 0c20 d410 d503 2c09 c0e0 ...1... ....,... > > 0000070: 0401 c2ac 0c20 d410 d503 2c09 01c2 9090 ..... ....,..... > > 0000080: b402 cd21 4139 e975 c1c3 5000 ...!A9.u..P. > > Here's the disassembly, for my benefit and for whoever else is reading this. > > kragen@VOSTRO9:~/devel$ objdump -m i8086 -b binary --adjust-vma=0x100 -D > loader.com > > loader.com: file format binary > > > Disassembly of section .data: > > 00000100 <.data>: > 100: 31 c9 xor %cx,%cx > 102: bf 00 03 mov $0x300,%di > 105: ba 8a 01 mov $0x18a,%dx > 108: b4 0a mov $0xa,%ah > 10a: cd 21 int $0x21
int 21h function 0ah: buffered input from standard input, with buffer at %dx, which points just past the end of the program. Not sure what's up with %cx and %di here. Note that 105 here is the jump target of the instruction at 128, so the initialization of %cx and %di is outside of an input loop. > 10c: a1 8b 01 mov 0x18b,%ax "number of chars actually read" > 10f: 3c 04 cmp $0x4,%al > 111: 7c 17 jl 0x12a If less than 4 chars read, exit the loop. > 113: b8 00 01 mov $0x100,%ax > 116: 01 c8 add %cx,%ax > 118: bb 00 02 mov $0x200,%bx > 11b: 02 1e 8c 01 add 0x18c,%bl > 11f: 89 07 mov %ax,(%bx) We're computing a two-byte value here in %ax to store in memory at %bx. %bx is going to be 0x200 plus whatever was stored at 0x18c, which was the first byte of input. So we're indexing a table at 0x200 with the first byte of input. %ax is 0x100 plus %cx. %cx started out as 0 before entering the loop and gets incremented each time through the loop, and I guess probably the system call doesn't clobber it, so it's the line number. So this stores the current line number (or, equivalently, output offset) in a table entry indexed by the first byte of input. It seems a little alarming that we're storing a two-byte line number/byte offset in a single-byte table entry. I suppose that's not a problem as long as your labels are always at least two letters apart... but isn't there an x86 addressing mode that makes that problem easier? So you could do `mov %ax, [0x200+2*bx]` or something, with just the input byte in bx? Probably then you'd want to initialize %di to 0x400 in case somebody wants to use extended ASCII labels. > 121: be 8c 01 mov $0x18c,%si > 124: a5 movsw %ds:(%si),%es:(%di) > 125: a5 movsw %ds:(%si),%es:(%di) Now we append the first four bytes of input to the buffer at %di, which was initialized to 0x300. > 126: 90 nop > 127: 41 inc %cx > 128: eb db jmp 0x105 Okay, so that's the end of the input loop. From here we have straight-line code until the output loop. > 12a: 31 c0 xor %ax,%ax > 12c: a3 20 02 mov %ax,0x220 Wiping out the definition of the "space" label. > 12f: 89 cd mov %cx,%bp Okay, so the total output size goes into %bp. > 131: 31 c9 xor %cx,%cx > 133: be 00 03 mov $0x300,%si We're gonna be copying from the stored input program text? > 136: bf 00 02 mov $0x200,%di ...into the symbol table? > 139: bb 01 00 mov $0x1,%bx > 13c: 31 c9 xor %cx,%cx That seems a little redundant. %cx is already pretty zeroed. > 13e: 31 d2 xor %dx,%dx > 140: b8 00 42 mov $0x4200,%ax > 143: cd 21 int $0x21 42h is lseek: set current file position. 00h in %al is from the start of the file. 1 in %bx is fd 1, stdout. %cx:%dx = 0 is the offset from the start of the file. Not yet sure why this lseek is useful; isn't that where you normally start writing the output if it's been redirected? > 145: b8 00 40 mov $0x4000,%ax > 148: cd 21 int $0x21 0x40 is write(), which is somewhat unexpected, since we haven't done any decoding yet. %cx is the number of bytes to write, which is presumably still 0. So this is sort of a mystery, maybe leftover code? Or maybe I screwed up the disassembly? It's the end of the straight-line code; the output loop starts here, which I still haven't really begun to analyze; perhaps tomorrow: > 14a: ac lods %ds:(%si),%al > 14b: 31 d2 xor %dx,%dx > 14d: 31 c0 xor %ax,%ax > 14f: ac lods %ds:(%si),%al > 150: 3c 20 cmp $0x20,%al > 152: 74 0f je 0x163 > 154: bb 01 01 mov $0x101,%bx > 157: 01 cb add %cx,%bx > 159: 29 da sub %bx,%dx > 15b: 01 f8 add %di,%ax > 15d: 89 c3 mov %ax,%bx > 15f: 8b 07 mov (%bx),%ax > 161: 01 c2 add %ax,%dx > 163: 31 c0 xor %ax,%ax > 165: ac lods %ds:(%si),%al > 166: 0c 20 or $0x20,%al > 168: d4 10 aam $0x10 > 16a: d5 03 aad $0x3 > 16c: 2c 09 sub $0x9,%al > 16e: c0 e0 04 shl $0x4,%al > 171: 01 c2 add %ax,%dx > 173: ac lods %ds:(%si),%al > 174: 0c 20 or $0x20,%al > 176: d4 10 aam $0x10 > 178: d5 03 aad $0x3 > 17a: 2c 09 sub $0x9,%al > 17c: 01 c2 add %ax,%dx > 17e: 90 nop > 17f: 90 nop > 180: b4 02 mov $0x2,%ah > 182: cd 21 int $0x21 2h is "write character (in %dl) to stdout". > 184: 41 inc %cx > 185: 39 e9 cmp %bp,%cx > 187: 75 c1 jne 0x14a > 189: c3 ret > 18a: 50 push %ax > ... Kragen -- To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-discuss