Re: Derby architecture/design documents

Dibyendu Majumdar 2 Feb 2005 00:52:54 -0000

From: "Jean T. Anderson" <[EMAIL PROTECTED]>
> Dibyendu Majumdar's doc on "Derby On Disk Page Format" is now live at 
> http://incubator.apache.org/derby/papers/pageformats.html.


Jean,

Attached is an updated version and an additional aart file.

All,

I am going to work on the following areas in the next few days:

a) Logging/recovery
b) BTree notes 

I will post stuff as I progress.

Regards

Dibyendu

<?xml version="1.0"?>
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd";>
<document>
  <header>
    <title>Derby On Disk Page Format</title>
    <abstract>This document describes the storage format of Derby disk pages. 

    This is a work-in-progress derived from Javadoc comments and 

    from explanations Mike Matrigali posted to the Derby lists. 

    Please post questions, comments, and corrections to 

    [EMAIL PROTECTED]

    </abstract>
  </header>
  <body>
    <section id="introduction">
      <title> Introduction </title>
      <p>Derby stores table and index data in Containers, which currently map 

        to files in the <code>seg0</code>

        directory of the database. In the current Derby implementation there is a a 1 to 1 mapping of

        containers to files. Two containers never map to a single file and 1

	  container never maps to multiple containers.</p>
      <p>

       Data is stored in pages within the container.</p>
      <p>A page contains a set of records, which can be accessed by "slot", which 

        defines the order of the records on the page, or by "id" which defines 

        the identity of the records on the page. Clients access records by both 

        slot and id, depending on their needs.</p>
      <p>A Table or a BTree index provides a row-based access mechanism (row-based 

        access interface is known as conglomerate). Rows are mapped to records 

        in data pages; in case of a table, a single row can span multiple records in 

        multiple pages.</p>
      <p>A container can have three types of pages:</p>
      <ul>
        <li>Header Page - which is just a specialized version of the Alloc Page.</li>
        <li>Data Pages which hold data, and</li>
        <li>Alloc Pages which hold page allocation information. An Alloc page is a specialized verion of the Data page.</li>
      </ul>
      <p>The container can be visualised as:<br/><img alt="" src="container-format.png"/></p>
      <p>

Header Page is currently always page 0 of the container.  It

contains information that raw store needs to maintain about the

container once per container, and is currently implemented as an Alloc

Page which "borrows" space from the alloc page for it's information.

The original decision was that the designers did not want to waste a whole page for

header information, so a part of the page was used and the first allocation

map was put on the second half of it. See <code>AllocPage.java</code> for info about layout and

borrowing.

</p>
      <p>
        <a href="#allocpage"> Allocation Page</a> - After page 0, all subsequent Allocation pages only

have allocation bit maps.

</p>
    </section>
    <section id="storedpage">
      <title>Data Page Format</title>
      <p>A data page is broken into five sections. 

        <img alt="" src="page-format.png"/></p>
      <section id="formatid">
        <title>Format Id </title>
        <p> The formatId is a 4 bytes array, it contains the format Id of this 

          page. The possible values are RAW_STORE_STORED_PAGE or RAW_STORE_ALLOC_PAGE.</p>
      </section>
      <section id="pageheader">
        <title> Page Header </title>
        <p> The page header is a fixed size, 56 bytes. </p>
        <table>
          <tr>
            <th>Size</th>
            <th>Type</th>
            <th>Description</th>
          </tr>
          <tr>
            <td>1 byte</td>
            <td>boolean</td>
            <td>is page an overflow page</td>
          </tr>
          <tr>
            <td>1 byte</td>
            <td>byte</td>
            <td>
              <p>page status is either VALID_PAGE or INVALID_PAGE(a field 

                  maintained in base page)</p>
              <p>page goes thru the following transition: 

                  <br/>

                  VALID_PAGE &lt;-&gt; deallocated page -&gt; free page &lt;-&gt; 

                  VALID_PAGE</p>
              <p>deallocated and free page are both INVALID_PAGE as far as BasePage 

                  is concerned. 

                  <br/>

                  When a page is deallocated, it transitioned from VALID_PAGE 

                  to INVALID_PAGE. 

                  <br/>

                  When a page is allocated, it trnasitioned from INVALID_PAGE 

                  to VALID_PAGE.</p>
            </td>
          </tr>
          <tr>
            <td>8 bytes</td>
            <td>long</td>
            <td>pageVersion (a field maintained in base page)</td>
          </tr>
          <tr>
            <td>2 bytes</td>
            <td>unsigned short</td>
            <td>number of slots in slot offset table</td>
          </tr>
          <tr>
            <td>4 bytes</td>
            <td>integer</td>
            <td>next record identifier</td>
          </tr>
          <tr>
            <td>4 bytes</td>
            <td>integer</td>
            <td>generation number of this page (Future Use)</td>
          </tr>
          <tr>
            <td>4 bytes</td>
            <td>integer</td>
            <td>previous generation of this page (Future Use)</td>
          </tr>
          <tr>
            <td>8 bytes</td>
            <td>bipLocation</td>
            <td>the location of the beforeimage page (Future Use)</td>
          </tr>
          <tr>
            <td>2 bytes</td>
            <td>unsigned short</td>
            <td>number of deleted rows on page. (new release 2.0)</td>
          </tr>
          <tr>
            <td>2 bytes</td>
            <td>unsigned short</td>
            <td>% of the page to keep free for updates</td>
          </tr>
          <tr>
            <td>2 bytes</td>
            <td>short</td>
            <td>spare for future use</td>
          </tr>
          <tr>
            <td>4 bytes</td>
            <td>long</td>
            <td>spare for future use (encryption uses to write random bytes 

                here).</td>
          </tr>
          <tr>
            <td>8 bytes</td>
            <td>long</td>
            <td>spare for future use</td>
          </tr>
          <tr>
            <td>8 bytes</td>
            <td>long</td>
            <td>spare for future use</td>
          </tr>
        </table>
        <note>Spare space is guaranteed to be writen with "0", so that future 

            use of field should not either not use "0" as a valid data item or 

            pick 0 as a valid default value so that on the fly upgrade can assume 

            that 0 means field was never assigned. </note>
      </section>
      <section id="records">
        <title> Records </title>
        <p>The records section contains zero or more records. Each record starts 

            with a Record Header</p>
        <table>
          <caption>Record Header</caption>
          <tr>
            <th>Type</th>
            <th>Description</th>
          </tr>
          <tr>
            <td>1 byte</td>
            <td>
              <p>Status bits for the record header</p>
              <table>
                <tr>
                  <td>RECORD_INITIAL</td>
                  <td>used when record header is first initialized</td>
                </tr>
                <tr>
                  <td>RECORD_DELETED</td>
                  <td>used to indicate the record has been deleted</td>
                </tr>
                <tr>
                  <td>RECORD_OVERFLOW</td>
                  <td>used to indicate the record has been overflowed, it will 

                      point to the overflow page and ID</td>
                </tr>
                <tr>
                  <td>RECORD_HAS_FIRST_FIELD</td>
                  <td>used to indicate that firstField is stored will be stored. 

                      When RECORD_OVERFLOW and RECORD_HAS_FIRST_FIELD both are 

                      set, part of record is on the page, the record header also 

                      stores the overflow point to the next part of the record.</td>
                </tr>
                <tr>
                  <td>RECORD_VALID_MASK</td>
                  <td>A mask of valid bits that can be set currently, such that 

                      the following assert can be made: </td>
                </tr>
              </table>
            </td>
          </tr>
          <tr>
            <td>compressed int</td>
            <td>record identifier</td>
          </tr>
          <tr>
            <td>compressed long</td>
            <td>overflow page only if RECORD_OVERFLOW is set</td>
          </tr>
          <tr>
            <td>compressed int</td>
            <td>overflow id only if RECORD_OVERFLOW is set</td>
          </tr>
          <tr>
            <td>compressed int</td>
            <td>first field only if RECORD_HAS_FIRST_FIELD is set - otherwise 

                0</td>
          </tr>
          <tr>
            <td>compressed int</td>
            <td>number of fields in this portion - only if RECORD_OVERFLOW is 

                false OR RECORD_HAS_FIRST_FIELD is true - otherwise 0</td>
          </tr>
        </table>
        <note label="Long Rows"> A row is long if all of it's columns can't fit on a single page. 

            When storing a long row, the segment of the row which fits on the 

            page is left there, and a pointer column is added at the end of the 

            row. It points to another row in the same container on a different 

            page. That row will contain the next set of columns and a continuation 

            pointer if necessary. The overflow portion will be on an "overflow" 

            page, and that page may have overflow portions of other rows on it 

            (unlike overflow columns). </note>
        <p>The Record Header is followed by one or more fields. Each field contains 

            a Field Header and optional Field Data.</p>
        <table>
          <caption>Stored Field Header Format</caption>
          <tr>
            <td>status</td>
            <td>
              <p> The status is 1 byte, it indicates the state of the field. 

                  A FieldHeader can be in the following states: </p>
              <table>
                <tr>
                  <td>NULL</td>
                  <td>if the field is NULL, no field data length is stored</td>
                </tr>
                <tr>
                  <td>OVERFLOW</td>
                  <td>indicates the field has been overflowed to another page. 

                        overflow page and overflow ID is stored at the end of 

                        the user data. field data length must be a number greater 

                        or equal to 0, indicating the length of the field that 

                        is stored on the current page. The format looks like this: 

                        <img alt="" src="field-header-overflow.png"/>

                        overflowPage will be written as compressed long, overflowId 

                        will be written as compressed Int</td>
                </tr>
                <tr>
                  <td>NONEXISTENT</td>
                  <td>the field no longer exists, e.g. column has been dropped 

                        during an alter table</td>
                </tr>
                <tr>
                  <td>EXTENSIBLE</td>
                  <td>the field is of user defined data type. The field may 

                        be tagged.</td>
                </tr>
                <tr>
                  <td>TAGGED</td>
                  <td>the field is TAGGED if and only if it is EXTENSIBLE.</td>
                </tr>
                <tr>
                  <td>FIXED</td>
                  <td>the field is FIXED if and only if it is used in the 

                        log records for version 1.2 and higher.</td>
                </tr>
              </table>
            </td>
          </tr>
          <tr>
            <td>fieldDataLength</td>
            <td> The fieldDataLength is only set if the field is not NULL. It 

                is the length of the field that is stored on the current page. 

                The fieldDataLength is a variable length CompressedInt. </td>
          </tr>
          <tr>
            <td>fieldData</td>
            <td>
              <p> Overflow page and overflow id are stored as field data. If 

                the overflow bit in status is set, the field data is the overflow 

                information. When the overflow bit is not set in status, then, 

                fieldData is the actually user data for the field. That means, 

                field header consists only field status, and field data length. 

                <br/>

                A non-overflow field: 

                <br/><img alt="" src="field-header-non-overflow.png"/><br/>

                An overflow field: 

                <br/><img alt="" src="field-header-overflow.png"/><br/><strong>overflowPage 

                  and overflowID</strong><br/>

                The overflowPage is a variable length CompressedLong, overflowID 

                is a variable Length CompressedInt. They are only stored when 

                the field state is OVERFLOW. And they are not stored in the field 

                header. Instead, they are stored at the end of the field data. 

                The reason we do that is to save a copy if the field has to overflow. </p>
            </td>
          </tr>
        </table>
        <note label="Long Columns"> A column is long if it can't fit on a single page. A long column 

            is marked as long in the base row, and it's field contains a pointer 

            to a chain of other rows in the same container with contain the data 

            of the row. Each of the subsequent rows is on a page to itself. Each 

            subsquent row, except for the last piece has 2 columns, the first 

            is the next segment of the row and the second is the pointer to the 

            the following segment. The last segment only has the data segment. 

          </note>
      </section>
      <section id="slottable">
        <title>Slot Offset Table</title>
        <p>The slot offset table is a table of 6 or 12 bytes per record, depending 

          on the pageSize being less or greater than 64K: </p>
        <table>
          <caption>Slot Table Record</caption>
          <tr>
            <th>Size</th>
            <th>Content</th>
          </tr>
          <tr>
            <td>2 bytes (unsigned short) or 4 bytes (int)</td>
            <td>page offset for the record that is assigned to the slot</td>
          </tr>
          <tr>
            <td>2 bytes (unsigned short) or 4 bytes (int)</td>
            <td>the length of the record on this page.</td>
          </tr>
          <tr>
            <td>2 bytes (unsigned short) or 4 bytes (int)</td>
            <td>the length of the reserved number of bytes for this record on 

                this page.</td>
          </tr>
        </table>
        <p>

          First slot is slot 0. The slot table grows backwards. Slots are never 

          left empty. </p>
      </section>
      <section id="checksum">
        <title>Checksum</title>
        <p>8 bytes of a java.util.zip.CRC32 checksum of the entire's page contents 

          without the 8 bytes representing the checksum.</p>
      </section>
    </section>
    <section id="allocpage">
      <title>Allocation Page</title>
      <p> An allocation page of the file container extends a normal Stored page, 

        with the exception that a hunk of space may be 'borrowed' by the file 

        container to store the file header.</p>
      <p> The borrowed space is not visible to the alloc page even though it is 

        present in the page data array. It is accessed directly by the FileContainer. 

        Any change made to the borrowed space is not managed or seen by the allocation 

        page.</p>
      <p> The reason for having this borrowed space is so that the container header 

        does not need to have a page of its own. </p>
      <p>
        <strong>Page Format</strong>
        <br/>

        An allocation page extends a stored page, the on disk format is different 

        from a stored page in that N bytes are 'borrowed' by the container and 

        the page header of an allocation page will be slightly bigger than a normal 

        stored page. This N bytes are stored between the page header and the record 

        space.</p>
      <p> The reason why this N bytes can't simply be a row is because it needs 

        to be statically accessible by the container object to avoid a chicken 

        and egg problem of the container object needing to instantiate an alloc 

        page object before it can be objectified, and an alloc page object needing 

        to instantiate a container object before it can be objectified. So this 

        N bytes must be stored outside of the normal record interface yet it must 

        be settable because only the first alloc page has this borrowed space. 

        Other (non-first) alloc page have N == 0. 

        <br/><img alt="" src="alloc-page.png"/></p>
      <p>

	N is a byte that indicates the size of the borrowed space.  Once an alloc

	page is initialized, the value of N cannot change.

	</p>
      <p>

	The maximum space that can be borrowed by the container is 256 bytes.

      </p>
      <p>

	The allocation pages are of the same page size as any other pages in the

	container. The first allocation page of the FileContainer starts at the

	first physical byte of the container.  Subsequent allocation pages are

	chained via the nextAllocPageOffset.  Each allocation page is expected to

	manage at least 1000 user pages (for 1K page size) so this chaining may not

	be a severe performance hit.  The logical -> physical mapping of an

	allocation page is stored in the previous allocation page.  The container

	object will need to maintain this mapping.</p>
      <p>

	The following fields are stored in the page header:

      </p>
      <table>
        <caption>

                Format of Alloc Page

            </caption>
        <tr>
          <th>

                Type

             </th>
          <th>

                Description

            </th>
        </tr>
        <tr>
          <td>

                int

            </td>
          <td>

                FormatId

            </td>
        </tr>
        <tr>
          <td>StoredPageHeader</td>
          <td>see <a href="#storedpage">Stored Page Header</a></td>
        </tr>
        <tr>
          <td>long</td>
          <td>nextAllocPageNumber - the next allocation page's number</td>
        </tr>
        <tr>
          <td>long</td>
          <td>nextAllocPageOffset - the file offset of the next allocation page</td>
        </tr>
        <tr>
          <td>long</td>
          <td>reserved1 - reserved for future usage</td>
        </tr>
        <tr>
          <td>long</td>
          <td>reserved2 - reserved for future usage</td>
        </tr>
        <tr>
          <td>long</td>
          <td>reserved3 - reserved for future usage</td>
        </tr>
        <tr>
          <td>long</td>
          <td>reserved4 - reserved for future usage</td>
        </tr>
        <tr>
          <td>byte</td>
          <td>N - the size of the borrowed container info</td>
        </tr>
        <tr>
          <td>byte[N]</td>
          <td>containerInfo - the content of the borrowed container info</td>
        </tr>
        <tr>
          <td>AllocExtent</td>
          <td>The one and only extent on this alloc page.</td>
        </tr>
      </table>
      <p>

	The allocation page contains allocation extent rows.  In this first cut

	implementation, there is only 1 allocation extent row per allocation page.

	</p>
      <p>

	The allocation extent row is an externalizable object and is directly

	written on to the page by the alloc page.  In other words, it will not be

	converted in to a storeableRow.  This is to cut down overhead, enhance

	performance and gives more control of the size and layout of the allocation

	extent row to the alloc page.

	</p>
      <section>
        <title>

	Alloc Page detailed implementation notes</title>
        <p>

	Create Container - an embryonic allocation page is formatted on disk by a

	special static function to avoid instantiating a full AllocPage object.

	This embryonic allocation has enough information that it can find the

	file header and not much else.  Then the allocation page is properly

	initialized by creating the first extent.

      </p>
        <p>

	Open Container - A static AllocPage method will be used to read off the

	container information directly from disk.  Even if

	the first alloc page (page 0) is already in the page cache, it will not be

	used because cleaning the alloc page will introduce a deadlock if the

	container is not in the container cache.  Long term, the first alloc page

	should probably live in the container cache rather than in the page cache.

      </p>
        <p>

	Get Page - The first alloc page (page 0) will be read into the page cache.

	Continue to follow the alloc page chain until the alloc page that manages

	the specified page is found.  From the alloc page, the physical offset of

	the specified page is located.

      </p>
        <p>

	Cleaning alloc page - the alloc page is written out the same way any page

	is written out.  The container object will provide a call back to the alloc

	page to write the current version of the container object back into the

	borrowed space before the alloc page itself is written out.

	</p>
        <p>

	Cleaning the container object - get the the first alloc page, dirty it and

	clean it (which will cause it to call the container object to write itself

	out into the borrowed space).  The versioning of the container is

	independent of the versioning of the alloc page.  The container version is

	stored inside the borrowed space and is opaque to the alloc page.

	</p>
        <p>For the fields in an allocation extent row.</p>
      </section>
    </section>
    <section>
      <title>Allocation Extent</title>
      <p>

	An allocation extent row manages the page status of page in the extent.

	AllocExtent is externalizable and is written to the AllocPage directly,

	without being converted to a row first.

	</p>
      <table>
        <caption>Format of Allocation Extent</caption>
        <tr>
          <th>Type</th>
          <th>Description</th>
        </tr>
        <tr>
          <td>long</td>
          <td>extentOffset - the begin physical byte offset of the first page of this extent</td>
        </tr>
        <tr>
          <td>long</td>
          <td>extentStart - the first logical page mananged by this extent.</td>
        </tr>
        <tr>
          <td>long</td>
          <td>extentEnd - the last page this extent can ever hope to manage.</td>
        </tr>
        <tr>
          <td>int</td>
          <td>extentLength - the number of pages allocated in this extent</td>
        </tr>
        <tr>
          <td>int</td>
          <td>
            <p>extentStatus - status bits for the whole extent.

				<br/>HAS_DEALLOCATED - most likely, this extent has a deallocated 

                        page somewhere. If !HAD_DEALLOCATED, the extent has no deallocated page.

				<br/>HAS_FREE - most likely, this extent has a free page somewhere.

						If !HAS_FREE, there is no free page in the extent.

				<br/>ALL_FREE - most likely, this extent only has free pages, good 

                        candidate for shrinking the file.

						If !ALL_FREE, the extent is not all free.

				<br/>HAS_UNFILLED_PAGES - most likely, this extent has unfilled pages.

						if !HAS_UNFILLED_PAGES, all pages are filled.

				<br/>KEEP_UNFILLED_PAGES - this extent keeps track of unfilled pages

						(post v1.3).  If not set, this extent has no notion of

						unfilled page and has no unFilledPage bitmap.

				<br/>NO_DEALLOC_PAGE_MAP - this extents do not have a dealloc and a

						free page bit maps.  Prior to 2.0, there are 2 bit

						maps, a deallocate page bit map and a free page bit

						map.  Cloudscape 2.0 and later merged the dealloc page

						bit map into the free page bit map.

				<br/>RETIRED - this extent contains only 'retired' pages, never use 

                        any page from this extent.  The pages don't actually 

                        exist, i.e., it maps to nothing (physicalOffset is 

                        garbage).  The purpose of this extent is to blot out a 

                        range of logical page numbers that no longer exists 

                        for this container.  Use this to reuse a physical page

                        when a logical page has exhausted all recordId or for

                        logical pages that has been shrunk out.

               </p>
          </td>
        </tr>
        <tr>
          <td>int</td>
          <td>preAllocLength - the number of pages that have been preallocated</td>
        </tr>
        <tr>
          <td>int</td>
          <td>reserved1</td>
        </tr>
        <tr>
          <td>long</td>
          <td>reserved2 - reserved for future use</td>
        </tr>
        <tr>
          <td>long</td>
          <td>reserved3 - reserved for future use</td>
        </tr>
        <tr>
          <td>FreePages(bit)</td>
          <td>Bitmap of free pages. Bit[i] is ON if page is free for immediate (re)use.</td>
        </tr>
        <tr>
          <td>unFilledPages(bit)</td>
          <td>Bitmap of pages that have free space. Bit[i] is ON if page is likely to be &lt; 1/2 full.</td>
        </tr>
      </table>
      <p>

		org.apache.derby.iapi.services.io.FormatableBitSet is used to store the bit map.  

            FormatableBitSet is an externalizable class.

         </p>
      <p>

	A page can have the following logical state:

	<br/>Free - a page that is free to be used

	<br/>Valid - a page that is currently in use

      </p>
      <p>

	There is another type of transitional pages which pages that have been

	allocated on disk but has not yet been used.  These pages are Free.

	</p>
      <p>

	Bit[K] freePages

		Bit[i] is ON iff page i maybe free for reuse.  User must get the

		dealloc page lock on the free page to make sure the transaction.

	</p>
      <p>

	K is the size of the bit array, it must be >= length.

      </p>
    </section>
  </body>
  <footer>
    <legal></legal>
  </footer>
</document>

container-format.aart
Description: Binary data

Re: Derby architecture/design documents

Reply via email to