It's still early yet, and we have a workable model for our immediate needs.
But this was weighing on my mind, so I thought I'd get some feelers out
there while I had pondered this. I'll file any notes away for later.
Goal #1: Native efficiency for native platform. As long as the file stays
within 4 GB, segments and constants are lookups and everything is
byte-oriented and aligned properly. Files larger than 4 GB can still be
handled. MMappable.
Goal #2: Cross-compilation. Produce PBC that will be native to a non-native
platform.
Goal #3: Unintentional portability. The ability to read native bytecode on
a non-native platform.
Goal #4: Multiple delivery mechanisms. From disk or over the net. Can it
be read and streamed efficiently?
Goal #5: Upgradable. Most format changes can take place without requiring
recompilation of the unit.
HEADER
Magic Word | Bit Alignment
U8 = 0x01
U8 = 0x31
U8 = 0x55
U8 = 0xa1
PBC File Info
U8 = Parrot Byte Code (PBC) format version
U8 = enumerated byte alignment [1]
U8 = Parrot Code Word (PCW) size
U8 = PCW pad size [2]
Table of Contents
U32 = Offset of METADATA segment [3][4]
U32 = Offset of FIXUP segment [3][4]
U32 = Offset of CONSTANT segment [3][4]
U32 = Offset of CODE segment [3][4]
U32 = Offset of SOURCE segment [3][4]
U32 = Offset of EOF [3][4]
METADATA
U32 = Size of METADATA segment [5]
U8* = user-defined metadata [6]
U8* = Segment alignment pad [7]
FIXUP
U32 = Size of FIXUP segment [5]
<structure TBD>
U8* = Segment alignment pad [7]
CONSTANT
U32 = Number of integer constants
U32* = integer constants
U32 = Number of floating constants [8]
TBD* = floating constants [8]
U32 = Number of string constants
U32* = Offset of STRINGS [9]
U8** = STRING constants [10]
U8* = Segment alignment pad [7]
CODE
U32 = Size of CODE segment [11]
U32* = PBC
U8* = Segment alignment pad [7]
SOURCE
U32 = Size of SOURCE segment [5]
U8* = Source [12]
U8* = Segment alignment pad [7]
EOF
U = unsigned
8 = byte
32 = 4 bytes. All entries marked as U32 values stored in
PCW + PCW pad bytes.
* = list of values
All sections exist, and are a minimum of PWC in length.
[1] Are there enough non-standard byte configurations that an
enumerated field wouldn't be sufficient?
[2] The size in bytes padded before each PCW, most likely because of
potential alignment issues. How does this affect the PCW itself on
different platforms, since the values themselves are fixed at 32-bit?
Below, each byte in a U32 value is expressed as 0x32, and each byte
within the PCW pad is expressed as 0xFF (to differentiate it from
the natural padding within a larger word, 0x00, which would be its
actual value). The examples are normalized to a common endianness.
32 bit PCW at 4 byte alignment (4/0)
0x32323232 0x32323232 0x32323232 0x32323232
32 bit PCW at 8 byte alignment (4/4)
0xFFFFFFFF 0x32323232 0xFFFFFFFF 0x32323232
64 bit PCW at 4 byte alignment (8/0)
0x00000000 0x32323232 0x00000000 0x32323232
64 bit PCW at 8 byte alignment (8/0)
0x00000000 0x32323232 0x00000000 0x32323232
[3] The offset is relative to the *end* of the HEADER segment. The offsets
are in terms of (padded) PWCs
[4] MAX_U32 is reserved for ERANGE. In the event of ERANGE, the
segments can only be found by walking the segments serially, computing
the next offset via the size fields.
[5] As measured in bytes. The size does not include the size field
itself, nor the segment padding bytes that occur at the end of the segment.
[6] Metadata is limited to MAX_U32 bytes.
[7] Segments are aligned to 8 bytes. The number of segment alignment bytes
are computed, and are not included in the size fields.
[8] Assuming we derive a portable floating point format. If that format
should be a STRING, floating point constants will reside in the string
constants section, and the floating point section will be removed.
[9] Offsets from the beginning of the first STRING constant. The offset at
string_index[number of strings] points to the first byte after the last
string.
[10] The STRING constant data must fit entirely within MAX_U32 - 1 bytes.
(This is an arbitrary limitation, simply to remove the possibility of having
to linearly scan each STRING to determine where the next one begins.)
[11] As measured in (padded) PWCs. The size does not include the size field
itself, nor the segment padding bytes that occur at the end of the segment.
[12] Currently, as one STRING. It could potentially be a list (similar to
STRING constants) of source code lines, or of file offsets of multiple
files within a larger compilation unit.
--
Bryan C. Warnock
[EMAIL PROTECTED]