Re: Derby architecture/design documents

Dibyendu Majumdar 31 Jan 2005 01:16:06 -0000

Jean,

Attached is an initial document on Derby page formats. I used forrest 0.6,
so it should be easy to integrate.


Is it possible to put this up without publishing it - so that others can
review and comment on it?

BTW, it is not complete ....

Regards

Dibyendu

<?xml version="1.0"?>
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd";>
<document> 
  <header> 
    <title>Derby On Disk Page Format</title>
    <abstract>This document describes the storage format of Derby disk pages. 
    </abstract>
  </header>
  <body>
    <section id="introduction"> 
      <title> Inroduction </title>
      <p>Derby stores table and index data in Containers, which currently map 
        to files in the 
        <code>seg0</code>
        directory of the database. Data is stored in pages within the container.</p>
      <fixme author="Dibyendu Majumdar"> Do all containers map to a single file, or does each container map 
        to a file? </fixme>
      <p>A page contains a set of records, which can be accessed by "slot", which 
        defines the order of the records on the page, or by "id" which defines 
        the identity of the records on the page. Clients access records by both 
        slot and id, depending on their needs.</p>
      <p>There are two types of pages - Raw Stored Pages which hold data, and 
        Raw Stored Alloc Pages which hold page allocation information.</p>
      <p>A Table or a BTree index provides a row-based access mechanism (row-based 
        access interface is known as conglomerate). Rows are mapped to records 
        in pages, in case of a table, a single row can span multiple records in 
        multiple pages.</p>
    </section>
    <section id="storedpage"> 
      <title>Data Page Format</title>
      <p>A data page is broken into five sections. 
        <img src="page-format.png" alt=""/>
      </p>
      <section id="formatid"> 
        <title>Format Id </title>
        <p> The formatId is a 4 bytes array, it contains the format Id of this 
          page. The possible values are RAW_STORE_STORED_PAGE or RAW_STORE_ALLOC_PAGE.</p>
      </section>
      <section id="pageheader"> 
        <title> Page Header </title>
        <p> The page header is a fixed size, 56 bytes. </p>
          <table> 
            <tr> 
              <th>Size</th>
              <th>Type</th>
              <th>Description</th>
            </tr>
            <tr> 
              <td>1 byte</td>
              <td>boolean</td>
              <td>is page an overflow page</td>
            </tr>
            <tr> 
              <td>1 byte</td>
              <td>byte</td>
              <td><p>page status is either VALID_PAGE or INVALID_PAGE(a field 
                  maintained in base page)</p>
                <p>page goes thru the following transition: 
                  <br/>
                  VALID_PAGE &lt;-&gt; deallocated page -&gt; free page &lt;-&gt; 
                  VALID_PAGE</p>
                <p>deallocated and free page are both INVALID_PAGE as far as BasePage 
                  is concerned. 
                  <br/>
                  When a page is deallocated, it transitioned from VALID_PAGE 
                  to INVALID_PAGE. 
                  <br/>
                  When a page is allocated, it trnasitioned from INVALID_PAGE 
                  to VALID_PAGE.</p></td>
            </tr>
            <tr> 
              <td>8 bytes</td>
              <td>long</td>
              <td>pageVersion (a field maintained in base page)</td>
            </tr>
            <tr> 
              <td>2 bytes</td>
              <td>unsigned short</td>
              <td>number of slots in slot offset table</td>
            </tr>
            <tr> 
              <td>4 bytes</td>
              <td>integer</td>
              <td>next record identifier</td>
            </tr>
            <tr> 
              <td>4 bytes</td>
              <td>integer</td>
              <td>generation number of this page (Future Use)</td>
            </tr>
            <tr> 
              <td>4 bytes</td>
              <td>integer</td>
              <td>previous generation of this page (Future Use)</td>
            </tr>
            <tr> 
              <td>8 bytes</td>
              <td>bipLocation</td>
              <td>the location of the beforeimage page (Future Use)</td>
            </tr>
            <tr> 
              <td>2 bytes</td>
              <td>unsigned short</td>
              <td>number of deleted rows on page. (new release 2.0)</td>
            </tr>
            <tr> 
              <td>2 bytes</td>
              <td>unsigned short</td>
              <td>% of the page to keep free for updates</td>
            </tr>
            <tr> 
              <td>2 bytes</td>
              <td>short</td>
              <td>spare for future use</td>
            </tr>
            <tr> 
              <td>4 bytes</td>
              <td>long</td>
              <td>spare for future use (encryption uses to write random bytes 
                here).</td>
            </tr>
            <tr> 
              <td>8 bytes</td>
              <td>long</td>
              <td>spare for future use</td>
            </tr>
            <tr> 
              <td>8 bytes</td>
              <td>long</td>
              <td>spare for future use</td>
            </tr>
          </table>
          <note>Spare space is guaranteed to be writen with "0", so that future 
            use of field should not either not use "0" as a valid data item or 
            pick 0 as a valid default value so that on the fly upgrade can assume 
            that 0 means field was never assigned. </note>
        
      </section>
      <section id="records"> 
        <title> Records </title>
        
          <p>The records section contains zero or more records. Each record starts 
            with a Record Header</p>
          <table> 
            <caption>Record Header</caption>
            <tr> 
              <th>Type</th>
              <th>Description</th>
            </tr>
            <tr> 
              <td>1 byte</td>
              <td> <p>Status bits for the record header</p>
                <table> 
                  <tr> 
                    <td>RECORD_INITIAL</td>
                    <td>used when record header is first initialized</td>
                  </tr>
                  <tr> 
                    <td>RECORD_DELETED</td>
                    <td>used to indicate the record has been deleted</td>
                  </tr>
                  <tr> 
                    <td>RECORD_OVERFLOW</td>
                    <td>used to indicate the record has been overflowed, it will 
                      point to the overflow page and ID</td>
                  </tr>
                  <tr> 
                    <td>RECORD_HAS_FIRST_FIELD</td>
                    <td>used to indicate that firstField is stored will be stored. 
                      When RECORD_OVERFLOW and RECORD_HAS_FIRST_FIELD both are 
                      set, part of record is on the page, the record header also 
                      stores the overflow point to the next part of the record.</td>
                  </tr>
                  <tr> 
                    <td>RECORD_VALID_MASK</td>
                    <td>A mask of valid bits that can be set currently, such that 
                      the following assert can be made: </td>
                  </tr>
                </table></td>
            </tr>
            <tr> 
              <td>compressed int</td>
              <td>record identifier</td>
            </tr>
            <tr> 
              <td>compressed long</td>
              <td>overflow page only if RECORD_OVERFLOW is set</td>
            </tr>
            <tr> 
              <td>compressed int</td>
              <td>overflow id only if RECORD_OVERFLOW is set</td>
            </tr>
            <tr> 
              <td>compressed int</td>
              <td>first field only if RECORD_HAS_FIRST_FIELD is set - otherwise 
                0</td>
            </tr>
            <tr> 
              <td>compressed int</td>
              <td>number of fields in this portion - only if RECORD_OVERFLOW is 
                false OR RECORD_HAS_FIRST_FIELD is true - otherwise 0</td>
            </tr>
          </table>
          <note label="Long Rows"> A row is long if all of it's columns can't fit on a single page. 
            When storing a long row, the segment of the row which fits on the 
            page is left there, and a pointer column is added at the end of the 
            row. It points to another row in the same container on a different 
            page. That row will contain the next set of columns and a continuation 
            pointer if necessary. The overflow portion will be on an "overflow" 
            page, and that page may have overflow portions of other rows on it 
            (unlike overflow columns). </note>
          <p>The Record Header is followed by one or more fields. Each field contains 
            a Field Header and optional Field Data.</p>
          <table> 
            <caption>Stored Field Header Format</caption>
            <tr> 
              <td>status</td>
              <td> <p> The status is 1 byte, it indicates the state of the field. 
                  A FieldHeader can be in the following states: </p>
                  <table> 
                    <tr> 
                      <td>NULL</td>
                      <td>if the field is NULL, no field data length is stored</td>
                    </tr>
                    <tr> 
                      <td>OVERFLOW</td>
                      <td>indicates the field has been overflowed to another page. 
                        overflow page and overflow ID is stored at the end of 
                        the user data. field data length must be a number greater 
                        or equal to 0, indicating the length of the field that 
                        is stored on the current page. The format looks like this: 
                        <img src="field-header-overflow.png" alt=""/>
                        overflowPage will be written as compressed long, overflowId 
                        will be written as compressed Int</td>
                    </tr>
                    <tr> 
                      <td>NONEXISTENT</td>
                      <td>the field no longer exists, e.g. column has been dropped 
                        during an alter table</td>
                    </tr>
                    <tr> 
                      <td>EXTENSIBLE</td>
                      <td>the field is of user defined data type. The field may 
                        be tagged.</td>
                    </tr>
                    <tr> 
                      <td>TAGGED</td>
                      <td>the field is TAGGED if and only if it is EXTENSIBLE.</td>
                    </tr>
                    <tr> 
                      <td>FIXED</td>
                      <td>the field is FIXED if and only if it is used in the 
                        log records for version 1.2 and higher.</td>
                    </tr>
                  </table>
                </td>
            </tr>
            <tr> 
              <td>fieldDataLength</td>
              <td> The fieldDataLength is only set if the field is not NULL. It 
                is the length of the field that is stored on the current page. 
                The fieldDataLength is a variable length CompressedInt. </td>
            </tr>
            <tr> 
              <td>fieldData</td>
              <td><p> Overflow page and overflow id are stored as field data. If 
                the overflow bit in status is set, the field data is the overflow 
                information. When the overflow bit is not set in status, then, 
                fieldData is the actually user data for the field. That means, 
                field header consists only field status, and field data length. 
                <br/>
                A non-overflow field: 
                <br/> <img src="field-header-non-overflow.png" alt=""/> <br/>
                An overflow field: 
                <br/> <img src="field-header-overflow.png" alt=""/> <br/> <strong>overflowPage 
                  and overflowID</strong> <br/>
                The overflowPage is a variable length CompressedLong, overflowID 
                is a variable Length CompressedInt. They are only stored when 
                the field state is OVERFLOW. And they are not stored in the field 
                header. Instead, they are stored at the end of the field data. 
                The reason we do that is to save a copy if the field has to overflow. </p>
              </td>
            </tr>
          </table>
          <note label="Long Columns"> A column is long if it can't fit on a single page. A long column 
            is marked as long in the base row, and it's field contains a pointer 
            to a chain of other rows in the same container with contain the data 
            of the row. Each of the subsequent rows is on a page to itself. Each 
            subsquent row, except for the last piece has 2 columns, the first 
            is the next segment of the row and the second is the pointer to the 
            the following segment. The last segment only has the data segment. 
          </note>
        
      </section>
      <section id="slottable"> 
        <title>Slot Offset Table</title>
        <p>The slot offset table is a table of 6 or 12 bytes per record, depending 
          on the pageSize being less or greater than 64K: </p>
          <table> 
            <caption>Slot Table Record</caption>
            <tr> 
              <th>Size</th>
              <th>Content</th>
            </tr>
            <tr> 
              <td>2 bytes (unsigned short) or 4 bytes (int)</td>
              <td>page offset for the record that is assigned to the slot</td>
            </tr>
            <tr> 
              <td>2 bytes (unsigned short) or 4 bytes (int)</td>
              <td>the length of the record on this page.</td>
            </tr>
            <tr> 
              <td>2 bytes (unsigned short) or 4 bytes (int)</td>
              <td>the length of the reserved number of bytes for this record on 
                this page.</td>
            </tr>
          </table>
		  <p>
          First slot is slot 0. The slot table grows backwards. Slots are never 
          left empty. </p>
      </section>
      <section id="checksum"> 
        <title>Checksum</title>
        <p>8 bytes of a java.util.zip.CRC32 checksum of the entire's page contents 
          without the 8 bytes representing the checksum.</p>
      </section>
	</section>
    <section id="allocpage"> 
      <title>Allocation Page</title>
      <p> An allocation page of the file container extends a normal Stored page, 
        with the exception that a hunk of space may be 'borrowed' by the file 
        container to store the file header.</p>
      <p> The borrowed space is not visible to the alloc page even though it is 
        present in the page data array. It is accessed directly by the FileContainer. 
        Any change made to the borrowed space is not managed or seen by the allocation 
        page.</p>
      <p> The reason for having this borrowed space is so that the container header 
        does not need to have a page of its own. </p>
      <p> 
        <strong>Page Format</strong>
        <br/>
        An allocation page extends a stored page, the on disk format is different 
        from a stored page in that N bytes are 'borrowed' by the container and 
        the page header of an allocation page will be slightly bigger than a normal 
        stored page. This N bytes are stored between the page header and the record 
        space.</p>
      <p> The reason why this N bytes can't simply be a row is because it needs 
        to be statically accessible by the container object to avoid a chicken 
        and egg problem of the container object needing to instantiate an alloc 
        page object before it can be objectified, and an alloc page object needing 
        to instantiate a container object before it can be objectified. So this 
        N bytes must be stored outside of the normal record interface yet it must 
        be settable because only the first alloc page has this borrowed space. 
        Other (non-first) alloc page have N == 0. 
        <br/>
        <img src="alloc-page.png" alt=""/>
      </p>
    </section>
  </body>
  <footer> 
    <legal>This is a legal notice, so it is 
      <strong>important</strong>
      .</legal>
  </footer>
</document>

field-header-non-overflow.aart
Description: Binary data

field-header-overflow.aart
Description: Binary data

page-format.aart
Description: Binary data

alloc-page.aart
Description: Binary data

Re: Derby architecture/design documents

Reply via email to