Re: Derby architecture/design documents

Mike Matrigali 1 Feb 2005 19:21:19 -0000

answers/comments embedded below


Dibyendu Majumdar wrote:

> Jean,
> 
> Attached is an initial document on Derby page formats. I used forrest 0.6,
> so it should be easy to integrate.
> 
> Is it possible to put this up without publishing it - so that others can
> review and comment on it?
> 
> BTW, it is not complete ....
> 
> Regards
> 
> Dibyendu
> 
> 
> ------------------------------------------------------------------------
> 
> <?xml version="1.0"?>
> <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" 
> "http://forrest.apache.org/dtd/document-v20.dtd";>
> <document> 
>   <header> 
>     <title>Derby On Disk Page Format</title>
>     <abstract>This document describes the storage format of Derby disk pages. 
>     </abstract>
>   </header>
>   <body>
>     <section id="introduction"> 
>       <title> Inroduction </title>
>       <p>Derby stores table and index data in Containers, which currently map 
>         to files in the 
>         <code>seg0</code>
>         directory of the database. Data is stored in pages within the 
> container.</p>
>       <fixme author="Dibyendu Majumdar"> Do all containers map to a single 
> file, or does each container map 
>         to a file? </fixme>
In the current derby implementation there is a a 1 to 1 mapping of
containers to files.  2 containers never map to a single file and 1
container never maps to multiple containers.
>       <p>A page contains a set of records, which can be accessed by "slot", 
> which 
>         defines the order of the records on the page, or by "id" which 
> defines 
>         the identity of the records on the page. Clients access records by 
> both 
>         slot and id, depending on their needs.</p>
>       <p>There are two types of pages - Raw Stored Pages which hold data, and 
>         Raw Stored Alloc Pages which hold page allocation information.</p>
I usually describe the system as having 3 types of pages, though the
Header Page is actually just a specialized version of the Alloc Page.
It looks like you include this later - up to you on how best to describe
the distinction.  I know I always draw the following kind of picture

--------
| header |
----------
| data    |
-----------
| data    |
-----------
| ...     |
-----------
| alloc    |
-----------
| data |

Header Page - This is currently always page 0 of the container.  It
contains information that raw store needs to maintain about the
container once per container, it is currently implemented as an Alloc
Page which "borrows" space from the alloc page for it's information.
The original decision was that we did not want to waste a whole page for
hdr information, so we used part of it and then put the 1st allocation
map on the 2nd half of it.  See AllocPage.java for info about layout and
borrowing.

Allocation Page - After page 0, all subsequent Allocation pages only
have allocation bit maps.

Data page - as you describe
>       <p>A Table or a BTree index provides a row-based access mechanism 
> (row-based 
>         access interface is known as conglomerate). Rows are mapped to 
> records 
>         in pages, in case of a table, a single row can span multiple records 
> in 
>         multiple pages.</p>
>     </section>
>     <section id="storedpage"> 
>       <title>Data Page Format</title>
>       <p>A data page is broken into five sections. 
>         <img src="page-format.png" alt=""/>
>       </p>
>       <section id="formatid"> 
>         <title>Format Id </title>
>         <p> The formatId is a 4 bytes array, it contains the format Id of 
> this 
>           page. The possible values are RAW_STORE_STORED_PAGE or 
> RAW_STORE_ALLOC_PAGE.</p>
>       </section>
>       <section id="pageheader"> 
>         <title> Page Header </title>
>         <p> The page header is a fixed size, 56 bytes. </p>
>           <table> 
>             <tr> 
>               <th>Size</th>
>               <th>Type</th>
>               <th>Description</th>
>             </tr>
>             <tr> 
>               <td>1 byte</td>
>               <td>boolean</td>
>               <td>is page an overflow page</td>
>             </tr>
>             <tr> 
>               <td>1 byte</td>
>               <td>byte</td>
>               <td><p>page status is either VALID_PAGE or INVALID_PAGE(a field 
>                   maintained in base page)</p>
>                 <p>page goes thru the following transition: 
>                   <br/>
>                   VALID_PAGE &lt;-&gt; deallocated page -&gt; free page 
> &lt;-&gt; 
>                   VALID_PAGE</p>
>                 <p>deallocated and free page are both INVALID_PAGE as far as 
> BasePage 
>                   is concerned. 
>                   <br/>
>                   When a page is deallocated, it transitioned from VALID_PAGE 
>                   to INVALID_PAGE. 
>                   <br/>
>                   When a page is allocated, it trnasitioned from INVALID_PAGE 
>                   to VALID_PAGE.</p></td>
>             </tr>
>             <tr> 
>               <td>8 bytes</td>
>               <td>long</td>
>               <td>pageVersion (a field maintained in base page)</td>
>             </tr>
>             <tr> 
>               <td>2 bytes</td>
>               <td>unsigned short</td>
>               <td>number of slots in slot offset table</td>
>             </tr>
>             <tr> 
>               <td>4 bytes</td>
>               <td>integer</td>
>               <td>next record identifier</td>
>             </tr>
>             <tr> 
>               <td>4 bytes</td>
>               <td>integer</td>
>               <td>generation number of this page (Future Use)</td>
>             </tr>
>             <tr> 
>               <td>4 bytes</td>
>               <td>integer</td>
>               <td>previous generation of this page (Future Use)</td>
>             </tr>
>             <tr> 
>               <td>8 bytes</td>
>               <td>bipLocation</td>
>               <td>the location of the beforeimage page (Future Use)</td>
>             </tr>
>             <tr> 
>               <td>2 bytes</td>
>               <td>unsigned short</td>
>               <td>number of deleted rows on page. (new release 2.0)</td>
>             </tr>
>             <tr> 
>               <td>2 bytes</td>
>               <td>unsigned short</td>
>               <td>% of the page to keep free for updates</td>
>             </tr>
>             <tr> 
>               <td>2 bytes</td>
>               <td>short</td>
>               <td>spare for future use</td>
>             </tr>
>             <tr> 
>               <td>4 bytes</td>
>               <td>long</td>
>               <td>spare for future use (encryption uses to write random bytes 
>                 here).</td>
>             </tr>
>             <tr> 
>               <td>8 bytes</td>
>               <td>long</td>
>               <td>spare for future use</td>
>             </tr>
>             <tr> 
>               <td>8 bytes</td>
>               <td>long</td>
>               <td>spare for future use</td>
>             </tr>
>           </table>
>           <note>Spare space is guaranteed to be writen with "0", so that 
> future 
>             use of field should not either not use "0" as a valid data item 
> or 
>             pick 0 as a valid default value so that on the fly upgrade can 
> assume 
>             that 0 means field was never assigned. </note>
>         
>       </section>
>       <section id="records"> 
>         <title> Records </title>
>         
>           <p>The records section contains zero or more records. Each record 
> starts 
>             with a Record Header</p>
>           <table> 
>             <caption>Record Header</caption>
>             <tr> 
>               <th>Type</th>
>               <th>Description</th>
>             </tr>
>             <tr> 
>               <td>1 byte</td>
>               <td> <p>Status bits for the record header</p>
>                 <table> 
>                   <tr> 
>                     <td>RECORD_INITIAL</td>
>                     <td>used when record header is first initialized</td>
>                   </tr>
>                   <tr> 
>                     <td>RECORD_DELETED</td>
>                     <td>used to indicate the record has been deleted</td>
>                   </tr>
>                   <tr> 
>                     <td>RECORD_OVERFLOW</td>
>                     <td>used to indicate the record has been overflowed, it 
> will 
>                       point to the overflow page and ID</td>
>                   </tr>
>                   <tr> 
>                     <td>RECORD_HAS_FIRST_FIELD</td>
>                     <td>used to indicate that firstField is stored will be 
> stored. 
>                       When RECORD_OVERFLOW and RECORD_HAS_FIRST_FIELD both 
> are 
>                       set, part of record is on the page, the record header 
> also 
>                       stores the overflow point to the next part of the 
> record.</td>
>                   </tr>
>                   <tr> 
>                     <td>RECORD_VALID_MASK</td>
>                     <td>A mask of valid bits that can be set currently, such 
> that 
>                       the following assert can be made: </td>
>                   </tr>
>                 </table></td>
>             </tr>
>             <tr> 
>               <td>compressed int</td>
>               <td>record identifier</td>
>             </tr>
>             <tr> 
>               <td>compressed long</td>
>               <td>overflow page only if RECORD_OVERFLOW is set</td>
>             </tr>
>             <tr> 
>               <td>compressed int</td>
>               <td>overflow id only if RECORD_OVERFLOW is set</td>
>             </tr>
>             <tr> 
>               <td>compressed int</td>
>               <td>first field only if RECORD_HAS_FIRST_FIELD is set - 
> otherwise 
>                 0</td>
>             </tr>
>             <tr> 
>               <td>compressed int</td>
>               <td>number of fields in this portion - only if RECORD_OVERFLOW 
> is 
>                 false OR RECORD_HAS_FIRST_FIELD is true - otherwise 0</td>
>             </tr>
>           </table>
>           <note label="Long Rows"> A row is long if all of it's columns can't 
> fit on a single page. 
>             When storing a long row, the segment of the row which fits on the 
>             page is left there, and a pointer column is added at the end of 
> the 
>             row. It points to another row in the same container on a 
> different 
>             page. That row will contain the next set of columns and a 
> continuation 
>             pointer if necessary. The overflow portion will be on an 
> "overflow" 
>             page, and that page may have overflow portions of other rows on 
> it 
>             (unlike overflow columns). </note>
>           <p>The Record Header is followed by one or more fields. Each field 
> contains 
>             a Field Header and optional Field Data.</p>
>           <table> 
>             <caption>Stored Field Header Format</caption>
>             <tr> 
>               <td>status</td>
>               <td> <p> The status is 1 byte, it indicates the state of the 
> field. 
>                   A FieldHeader can be in the following states: </p>
>                   <table> 
>                     <tr> 
>                       <td>NULL</td>
>                       <td>if the field is NULL, no field data length is 
> stored</td>
>                     </tr>
>                     <tr> 
>                       <td>OVERFLOW</td>
>                       <td>indicates the field has been overflowed to another 
> page. 
>                         overflow page and overflow ID is stored at the end of 
>                         the user data. field data length must be a number 
> greater 
>                         or equal to 0, indicating the length of the field 
> that 
>                         is stored on the current page. The format looks like 
> this: 
>                         <img src="field-header-overflow.png" alt=""/>
>                         overflowPage will be written as compressed long, 
> overflowId 
>                         will be written as compressed Int</td>
>                     </tr>
>                     <tr> 
>                       <td>NONEXISTENT</td>
>                       <td>the field no longer exists, e.g. column has been 
> dropped 
>                         during an alter table</td>
>                     </tr>
>                     <tr> 
>                       <td>EXTENSIBLE</td>
>                       <td>the field is of user defined data type. The field 
> may 
>                         be tagged.</td>
>                     </tr>
>                     <tr> 
>                       <td>TAGGED</td>
>                       <td>the field is TAGGED if and only if it is 
> EXTENSIBLE.</td>
>                     </tr>
>                     <tr> 
>                       <td>FIXED</td>
>                       <td>the field is FIXED if and only if it is used in the 
>                         log records for version 1.2 and higher.</td>
>                     </tr>
>                   </table>
>                 </td>
>             </tr>
>             <tr> 
>               <td>fieldDataLength</td>
>               <td> The fieldDataLength is only set if the field is not NULL. 
> It 
>                 is the length of the field that is stored on the current 
> page. 
>                 The fieldDataLength is a variable length CompressedInt. </td>
>             </tr>
>             <tr> 
>               <td>fieldData</td>
>               <td><p> Overflow page and overflow id are stored as field data. 
> If 
>                 the overflow bit in status is set, the field data is the 
> overflow 
>                 information. When the overflow bit is not set in status, 
> then, 
>                 fieldData is the actually user data for the field. That 
> means, 
>                 field header consists only field status, and field data 
> length. 
>                 <br/>
>                 A non-overflow field: 
>                 <br/> <img src="field-header-non-overflow.png" alt=""/> <br/>
>                 An overflow field: 
>                 <br/> <img src="field-header-overflow.png" alt=""/> <br/> 
> <strong>overflowPage 
>                   and overflowID</strong> <br/>
>                 The overflowPage is a variable length CompressedLong, 
> overflowID 
>                 is a variable Length CompressedInt. They are only stored when 
>                 the field state is OVERFLOW. And they are not stored in the 
> field 
>                 header. Instead, they are stored at the end of the field 
> data. 
>                 The reason we do that is to save a copy if the field has to 
> overflow. </p>
>               </td>
>             </tr>
>           </table>
>           <note label="Long Columns"> A column is long if it can't fit on a 
> single page. A long column 
>             is marked as long in the base row, and it's field contains a 
> pointer 
>             to a chain of other rows in the same container with contain the 
> data 
>             of the row. Each of the subsequent rows is on a page to itself. 
> Each 
>             subsquent row, except for the last piece has 2 columns, the first 
>             is the next segment of the row and the second is the pointer to 
> the 
>             the following segment. The last segment only has the data 
> segment. 
>           </note>
>         
>       </section>
>       <section id="slottable"> 
>         <title>Slot Offset Table</title>
>         <p>The slot offset table is a table of 6 or 12 bytes per record, 
> depending 
>           on the pageSize being less or greater than 64K: </p>
>           <table> 
>             <caption>Slot Table Record</caption>
>             <tr> 
>               <th>Size</th>
>               <th>Content</th>
>             </tr>
>             <tr> 
>               <td>2 bytes (unsigned short) or 4 bytes (int)</td>
>               <td>page offset for the record that is assigned to the slot</td>
>             </tr>
>             <tr> 
>               <td>2 bytes (unsigned short) or 4 bytes (int)</td>
>               <td>the length of the record on this page.</td>
>             </tr>
>             <tr> 
>               <td>2 bytes (unsigned short) or 4 bytes (int)</td>
>               <td>the length of the reserved number of bytes for this record 
> on 
>                 this page.</td>
>             </tr>
>           </table>
>                 <p>
>           First slot is slot 0. The slot table grows backwards. Slots are 
> never 
>           left empty. </p>
>       </section>
>       <section id="checksum"> 
>         <title>Checksum</title>
>         <p>8 bytes of a java.util.zip.CRC32 checksum of the entire's page 
> contents 
>           without the 8 bytes representing the checksum.</p>
>       </section>
>       </section>
>     <section id="allocpage"> 
>       <title>Allocation Page</title>
>       <p> An allocation page of the file container extends a normal Stored 
> page, 
>         with the exception that a hunk of space may be 'borrowed' by the file 
>         container to store the file header.</p>
>       <p> The borrowed space is not visible to the alloc page even though it 
> is 
>         present in the page data array. It is accessed directly by the 
> FileContainer. 
>         Any change made to the borrowed space is not managed or seen by the 
> allocation 
>         page.</p>
>       <p> The reason for having this borrowed space is so that the container 
> header 
>         does not need to have a page of its own. </p>
>       <p> 
>         <strong>Page Format</strong>
>         <br/>
>         An allocation page extends a stored page, the on disk format is 
> different 
>         from a stored page in that N bytes are 'borrowed' by the container 
> and 
>         the page header of an allocation page will be slightly bigger than a 
> normal 
>         stored page. This N bytes are stored between the page header and the 
> record 
>         space.</p>
>       <p> The reason why this N bytes can't simply be a row is because it 
> needs 
>         to be statically accessible by the container object to avoid a 
> chicken 
>         and egg problem of the container object needing to instantiate an 
> alloc 
>         page object before it can be objectified, and an alloc page object 
> needing 
>         to instantiate a container object before it can be objectified. So 
> this 
>         N bytes must be stored outside of the normal record interface yet it 
> must 
>         be settable because only the first alloc page has this borrowed 
> space. 
>         Other (non-first) alloc page have N == 0. 
>         <br/>
>         <img src="alloc-page.png" alt=""/>
>       </p>
>     </section>
>   </body>
>   <footer> 
>     <legal>This is a legal notice, so it is 
>       <strong>important</strong>
>       .</legal>
>   </footer>
> </document>

Re: Derby architecture/design documents

Reply via email to