Parrot Byte Code Format 2: Electric Boogaloo

Bryan C . Warnock Fri, 12 Oct 2001 01:29:10 -0700

It's still early yet, and we have a workable model for our immediate needs.  
But this was weighing on my mind, so I thought I'd get some feelers out 
there while I had pondered this.  I'll file any notes away for later.


Goal #1: Native efficiency for native platform.  As long as the file stays 
within 4 GB, segments and constants are lookups and everything is 
byte-oriented and aligned properly.  Files larger than 4 GB can still be 
handled.  MMappable.

Goal #2: Cross-compilation.  Produce PBC that will be native to a non-native 
platform.

Goal #3: Unintentional portability.  The ability to read native bytecode on 
a non-native platform.

Goal #4: Multiple delivery mechanisms.  From disk or over the net.  Can it 
be read and streamed efficiently?

Goal #5: Upgradable.  Most format changes can take place without requiring 
recompilation of the unit.

HEADER

  Magic Word | Bit Alignment
    U8   = 0x01
    U8   = 0x31
    U8   = 0x55
    U8   = 0xa1
  
  PBC File Info  
    U8   = Parrot Byte Code (PBC) format version
    U8   = enumerated byte alignment [1]
    U8   = Parrot Code Word (PCW) size
    U8   = PCW pad size [2]
    
  Table of Contents
    U32  = Offset of METADATA segment [3][4]
    U32  = Offset of FIXUP segment    [3][4]
    U32  = Offset of CONSTANT segment [3][4]
    U32  = Offset of CODE segment     [3][4]
    U32  = Offset of SOURCE segment   [3][4]
    U32  = Offset of EOF              [3][4]
    
METADATA

    U32  = Size of METADATA segment      [5]
    U8*  = user-defined metadata         [6]
    U8*  = Segment alignment pad         [7]
    
FIXUP

    U32  = Size of FIXUP segment         [5]
    <structure TBD>
    U8*  = Segment alignment pad         [7]
    
CONSTANT

    U32  = Number of integer constants
    U32* = integer constants
    U32  = Number of floating constants  [8]
    TBD* = floating constants            [8]
    U32  = Number of string constants
    U32* = Offset of STRINGS             [9]
    U8** = STRING constants              [10]
    U8*  = Segment alignment pad         [7]
    
CODE

    U32  = Size of CODE segment          [11]
    U32* = PBC
    U8*  = Segment alignment pad         [7]
    
SOURCE

    U32  = Size of SOURCE segment        [5]
    U8*  = Source                        [12]
    U8*  = Segment alignment pad         [7]
    
EOF                                      


U  = unsigned
8  = byte
32 = 4 bytes.  All entries marked as U32 values stored in
     PCW + PCW pad bytes.
*  = list of values

All sections exist, and are a minimum of PWC in length.

[1] Are there enough non-standard byte configurations that an
enumerated field wouldn't be sufficient?  

[2] The size in bytes padded before each PCW, most likely because of
potential alignment issues.  How does this affect the PCW itself on
different platforms, since the values themselves are fixed at 32-bit?
Below, each byte in a U32 value is expressed as 0x32, and each byte
within the PCW pad is expressed as 0xFF (to differentiate it from 
the natural padding within a larger word, 0x00, which would be its
actual value).  The examples are normalized to a common endianness.
    
  32 bit PCW at 4 byte alignment (4/0)
  
    0x32323232 0x32323232 0x32323232 0x32323232
    
  32 bit PCW at 8 byte alignment (4/4)
  
    0xFFFFFFFF 0x32323232 0xFFFFFFFF 0x32323232

  64 bit PCW at 4 byte alignment (8/0)
  
    0x00000000 0x32323232 0x00000000 0x32323232

  64 bit PCW at 8 byte alignment (8/0)
        
    0x00000000 0x32323232 0x00000000 0x32323232
  
[3] The offset is relative to the *end* of the HEADER segment.  The offsets
are in terms of (padded) PWCs

[4] MAX_U32 is reserved for ERANGE.  In the event of ERANGE, the
segments can only be found by walking the segments serially, computing
the next offset via the size fields.

[5] As measured in bytes.  The size does not include the size field
itself, nor the segment padding bytes that occur at the end of the segment.  

[6] Metadata is limited to MAX_U32 bytes.

[7] Segments are aligned to 8 bytes.  The number of segment alignment bytes 
are computed, and are not included in the size fields.

[8] Assuming we derive a portable floating point format.  If that format
should be a STRING, floating point constants will reside in the string
constants section, and the floating point section will be removed.

[9] Offsets from the beginning of the first STRING constant.  The offset at
string_index[number of strings] points to the first byte after the last
string.

[10] The STRING constant data must fit entirely within MAX_U32 - 1 bytes.
(This is an arbitrary limitation, simply to remove the possibility of having 
to linearly scan each STRING to determine where the next one begins.)

[11] As measured in (padded) PWCs.  The size does not include the size field
itself, nor the segment padding bytes that occur at the end of the segment.

[12] Currently, as one STRING.  It could potentially be a list (similar to
STRING constants) of source code lines, or of file offsets of multiple
files within a larger compilation unit.

-- 
Bryan C. Warnock
[EMAIL PROTECTED]

Parrot Byte Code Format 2: Electric Boogaloo

Reply via email to