Jean, Attached is an initial document on Derby page formats. I used forrest 0.6, so it should be easy to integrate.
Is it possible to put this up without publishing it - so that others can review and comment on it? BTW, it is not complete .... Regards Dibyendu
<?xml version="1.0"?> <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd"> <document> <header> <title>Derby On Disk Page Format</title> <abstract>This document describes the storage format of Derby disk pages. </abstract> </header> <body> <section id="introduction"> <title> Inroduction </title> <p>Derby stores table and index data in Containers, which currently map to files in the <code>seg0</code> directory of the database. Data is stored in pages within the container.</p> <fixme author="Dibyendu Majumdar"> Do all containers map to a single file, or does each container map to a file? </fixme> <p>A page contains a set of records, which can be accessed by "slot", which defines the order of the records on the page, or by "id" which defines the identity of the records on the page. Clients access records by both slot and id, depending on their needs.</p> <p>There are two types of pages - Raw Stored Pages which hold data, and Raw Stored Alloc Pages which hold page allocation information.</p> <p>A Table or a BTree index provides a row-based access mechanism (row-based access interface is known as conglomerate). Rows are mapped to records in pages, in case of a table, a single row can span multiple records in multiple pages.</p> </section> <section id="storedpage"> <title>Data Page Format</title> <p>A data page is broken into five sections. <img src="page-format.png" alt=""/> </p> <section id="formatid"> <title>Format Id </title> <p> The formatId is a 4 bytes array, it contains the format Id of this page. The possible values are RAW_STORE_STORED_PAGE or RAW_STORE_ALLOC_PAGE.</p> </section> <section id="pageheader"> <title> Page Header </title> <p> The page header is a fixed size, 56 bytes. </p> <table> <tr> <th>Size</th> <th>Type</th> <th>Description</th> </tr> <tr> <td>1 byte</td> <td>boolean</td> <td>is page an overflow page</td> </tr> <tr> <td>1 byte</td> <td>byte</td> <td><p>page status is either VALID_PAGE or INVALID_PAGE(a field maintained in base page)</p> <p>page goes thru the following transition: <br/> VALID_PAGE <-> deallocated page -> free page <-> VALID_PAGE</p> <p>deallocated and free page are both INVALID_PAGE as far as BasePage is concerned. <br/> When a page is deallocated, it transitioned from VALID_PAGE to INVALID_PAGE. <br/> When a page is allocated, it trnasitioned from INVALID_PAGE to VALID_PAGE.</p></td> </tr> <tr> <td>8 bytes</td> <td>long</td> <td>pageVersion (a field maintained in base page)</td> </tr> <tr> <td>2 bytes</td> <td>unsigned short</td> <td>number of slots in slot offset table</td> </tr> <tr> <td>4 bytes</td> <td>integer</td> <td>next record identifier</td> </tr> <tr> <td>4 bytes</td> <td>integer</td> <td>generation number of this page (Future Use)</td> </tr> <tr> <td>4 bytes</td> <td>integer</td> <td>previous generation of this page (Future Use)</td> </tr> <tr> <td>8 bytes</td> <td>bipLocation</td> <td>the location of the beforeimage page (Future Use)</td> </tr> <tr> <td>2 bytes</td> <td>unsigned short</td> <td>number of deleted rows on page. (new release 2.0)</td> </tr> <tr> <td>2 bytes</td> <td>unsigned short</td> <td>% of the page to keep free for updates</td> </tr> <tr> <td>2 bytes</td> <td>short</td> <td>spare for future use</td> </tr> <tr> <td>4 bytes</td> <td>long</td> <td>spare for future use (encryption uses to write random bytes here).</td> </tr> <tr> <td>8 bytes</td> <td>long</td> <td>spare for future use</td> </tr> <tr> <td>8 bytes</td> <td>long</td> <td>spare for future use</td> </tr> </table> <note>Spare space is guaranteed to be writen with "0", so that future use of field should not either not use "0" as a valid data item or pick 0 as a valid default value so that on the fly upgrade can assume that 0 means field was never assigned. </note> </section> <section id="records"> <title> Records </title> <p>The records section contains zero or more records. Each record starts with a Record Header</p> <table> <caption>Record Header</caption> <tr> <th>Type</th> <th>Description</th> </tr> <tr> <td>1 byte</td> <td> <p>Status bits for the record header</p> <table> <tr> <td>RECORD_INITIAL</td> <td>used when record header is first initialized</td> </tr> <tr> <td>RECORD_DELETED</td> <td>used to indicate the record has been deleted</td> </tr> <tr> <td>RECORD_OVERFLOW</td> <td>used to indicate the record has been overflowed, it will point to the overflow page and ID</td> </tr> <tr> <td>RECORD_HAS_FIRST_FIELD</td> <td>used to indicate that firstField is stored will be stored. When RECORD_OVERFLOW and RECORD_HAS_FIRST_FIELD both are set, part of record is on the page, the record header also stores the overflow point to the next part of the record.</td> </tr> <tr> <td>RECORD_VALID_MASK</td> <td>A mask of valid bits that can be set currently, such that the following assert can be made: </td> </tr> </table></td> </tr> <tr> <td>compressed int</td> <td>record identifier</td> </tr> <tr> <td>compressed long</td> <td>overflow page only if RECORD_OVERFLOW is set</td> </tr> <tr> <td>compressed int</td> <td>overflow id only if RECORD_OVERFLOW is set</td> </tr> <tr> <td>compressed int</td> <td>first field only if RECORD_HAS_FIRST_FIELD is set - otherwise 0</td> </tr> <tr> <td>compressed int</td> <td>number of fields in this portion - only if RECORD_OVERFLOW is false OR RECORD_HAS_FIRST_FIELD is true - otherwise 0</td> </tr> </table> <note label="Long Rows"> A row is long if all of it's columns can't fit on a single page. When storing a long row, the segment of the row which fits on the page is left there, and a pointer column is added at the end of the row. It points to another row in the same container on a different page. That row will contain the next set of columns and a continuation pointer if necessary. The overflow portion will be on an "overflow" page, and that page may have overflow portions of other rows on it (unlike overflow columns). </note> <p>The Record Header is followed by one or more fields. Each field contains a Field Header and optional Field Data.</p> <table> <caption>Stored Field Header Format</caption> <tr> <td>status</td> <td> <p> The status is 1 byte, it indicates the state of the field. A FieldHeader can be in the following states: </p> <table> <tr> <td>NULL</td> <td>if the field is NULL, no field data length is stored</td> </tr> <tr> <td>OVERFLOW</td> <td>indicates the field has been overflowed to another page. overflow page and overflow ID is stored at the end of the user data. field data length must be a number greater or equal to 0, indicating the length of the field that is stored on the current page. The format looks like this: <img src="field-header-overflow.png" alt=""/> overflowPage will be written as compressed long, overflowId will be written as compressed Int</td> </tr> <tr> <td>NONEXISTENT</td> <td>the field no longer exists, e.g. column has been dropped during an alter table</td> </tr> <tr> <td>EXTENSIBLE</td> <td>the field is of user defined data type. The field may be tagged.</td> </tr> <tr> <td>TAGGED</td> <td>the field is TAGGED if and only if it is EXTENSIBLE.</td> </tr> <tr> <td>FIXED</td> <td>the field is FIXED if and only if it is used in the log records for version 1.2 and higher.</td> </tr> </table> </td> </tr> <tr> <td>fieldDataLength</td> <td> The fieldDataLength is only set if the field is not NULL. It is the length of the field that is stored on the current page. The fieldDataLength is a variable length CompressedInt. </td> </tr> <tr> <td>fieldData</td> <td><p> Overflow page and overflow id are stored as field data. If the overflow bit in status is set, the field data is the overflow information. When the overflow bit is not set in status, then, fieldData is the actually user data for the field. That means, field header consists only field status, and field data length. <br/> A non-overflow field: <br/> <img src="field-header-non-overflow.png" alt=""/> <br/> An overflow field: <br/> <img src="field-header-overflow.png" alt=""/> <br/> <strong>overflowPage and overflowID</strong> <br/> The overflowPage is a variable length CompressedLong, overflowID is a variable Length CompressedInt. They are only stored when the field state is OVERFLOW. And they are not stored in the field header. Instead, they are stored at the end of the field data. The reason we do that is to save a copy if the field has to overflow. </p> </td> </tr> </table> <note label="Long Columns"> A column is long if it can't fit on a single page. A long column is marked as long in the base row, and it's field contains a pointer to a chain of other rows in the same container with contain the data of the row. Each of the subsequent rows is on a page to itself. Each subsquent row, except for the last piece has 2 columns, the first is the next segment of the row and the second is the pointer to the the following segment. The last segment only has the data segment. </note> </section> <section id="slottable"> <title>Slot Offset Table</title> <p>The slot offset table is a table of 6 or 12 bytes per record, depending on the pageSize being less or greater than 64K: </p> <table> <caption>Slot Table Record</caption> <tr> <th>Size</th> <th>Content</th> </tr> <tr> <td>2 bytes (unsigned short) or 4 bytes (int)</td> <td>page offset for the record that is assigned to the slot</td> </tr> <tr> <td>2 bytes (unsigned short) or 4 bytes (int)</td> <td>the length of the record on this page.</td> </tr> <tr> <td>2 bytes (unsigned short) or 4 bytes (int)</td> <td>the length of the reserved number of bytes for this record on this page.</td> </tr> </table> <p> First slot is slot 0. The slot table grows backwards. Slots are never left empty. </p> </section> <section id="checksum"> <title>Checksum</title> <p>8 bytes of a java.util.zip.CRC32 checksum of the entire's page contents without the 8 bytes representing the checksum.</p> </section> </section> <section id="allocpage"> <title>Allocation Page</title> <p> An allocation page of the file container extends a normal Stored page, with the exception that a hunk of space may be 'borrowed' by the file container to store the file header.</p> <p> The borrowed space is not visible to the alloc page even though it is present in the page data array. It is accessed directly by the FileContainer. Any change made to the borrowed space is not managed or seen by the allocation page.</p> <p> The reason for having this borrowed space is so that the container header does not need to have a page of its own. </p> <p> <strong>Page Format</strong> <br/> An allocation page extends a stored page, the on disk format is different from a stored page in that N bytes are 'borrowed' by the container and the page header of an allocation page will be slightly bigger than a normal stored page. This N bytes are stored between the page header and the record space.</p> <p> The reason why this N bytes can't simply be a row is because it needs to be statically accessible by the container object to avoid a chicken and egg problem of the container object needing to instantiate an alloc page object before it can be objectified, and an alloc page object needing to instantiate a container object before it can be objectified. So this N bytes must be stored outside of the normal record interface yet it must be settable because only the first alloc page has this borrowed space. Other (non-first) alloc page have N == 0. <br/> <img src="alloc-page.png" alt=""/> </p> </section> </body> <footer> <legal>This is a legal notice, so it is <strong>important</strong> .</legal> </footer> </document>
field-header-non-overflow.aart
Description: Binary data
field-header-overflow.aart
Description: Binary data
page-format.aart
Description: Binary data
alloc-page.aart
Description: Binary data
