gaodayue opened a new issue #2892: [Proposal] Unified and Extensible Page 
Format for segment_v2
URL: https://github.com/apache/incubator-doris/issues/2892
 
 
   ## Motivation
   There are 4 kinds of pages in segment_v2
   1. **data page**: for column data, terms and bitmaps of bitmap index, bloom 
filters of bf index
   2. **index page**: for ordinal and value index of IndexedColumn
   3. **short key page**: only for short key index
   4. **simple page**: for dictionary page, ordinal index page, and zonemap 
index page
   
   The format of each page type is described below
   
   ```
   DataPageHeader := FirstRowId(vint), NumRows(vint)
   NullMap := NullMapSize(vint), Byte^NullMapSize
   IndexEntry := KeySize(vint32), Byte^KeySize, PagePointer
   
   DataPage := DataPageHeader, [NullMap,] EncodedValue, [UncompressedSize(4),] 
Checksum(4)
   - NullMap presents only when the column is nullable
   - UncompressedSize presents only when CompressionTypePB != NO_COMPRESSION
   - EncodedValue is value encoded by PageBuilder according to EncodingTypePB
   
   IndexPage := IndexEntry^NumEntry, IndexPageFooterPB, FooterSize(4), 
[UncompressedSize(4),] Checksum(4)
   
   ShortKeyPage := EncodedKeys, EncodedOffsets, ShortKeyFooterPB, 
FooterSize(4), Checksum(4)
   
   SimplePage = EncodedValue, [UncompressedSize(4),] Checksum(4)
   ```
   
   We observed the following problems with the current design
   
   * For data page
     * We can't add new fields to DataPageHeader as needed in the future
     * It's SIMD unfriendly because SIMD instructions usually have alignment 
requirements, but it's hard to make EncodedValue starts at aligned address. We 
should move EncodedValue to the beginning of the page so that we can use SIMD 
to speed up value decoding in the future
     * NullMap is always written for nullable column, even when the page 
doesn't contain NULL
   * For simple page (like dictionary), it's difficult to extend without 
introducing a new page type
   * All these page types are implemented independently, thus we don't have 
good code reusability
   
   In order to solve all these problems,  a new unified and extensible page 
format is proposed. The goals are
   
   * make it easy to extend existing page type
   * make it easy to add new page type through code reuse
   * make it possible to use SIMD to speed up page decoding
   
   ## Design of the New Unified Page Format
   
   First of all, all pages share the same general layout as below.
   ```
   General Page Layout := PageContent, PageFooterPB, FooterSize(4), Checksum(4)
   PageContent:= DataPageContent | IndexPageContent | DictPageContent | 
ShortKeyPageContent
   ```
   
   Page metadata is recorded in footer instead of header. Protobuf is used to 
encode page footer so that we can extend existing page type in the future. Only 
PageContent is compressed so that we can read page metadata without 
decompressing page content. This is useful when we write tools to dump page 
statistics for segment_v2 files.
   
   PageFooterPB is defined as
   ```protobuf
   enum PageTypePB {
       DATA_PAGE = 0;
       INDEX_PAGE = 1;
       DICTIONARY_PAGE = 2;
       SHORT_KEY_PAGE = 3;
   }
   
   message PageFooterPB {
       // required: indicates which of the *_footer fields is set
       optional PageTypePB type = 1;
       // required: page body size before compression (exclude footer and crc).
       // page body is uncompressed when it's equal to page body size
       optional uint32 uncompressed_size = 2;
       // present only when type == DATA_PAGE
       optional DataPageFooterPB data_page_footer = 8;
       // present only when type == INDEX_PAGE
       optional IndexPageFooterPB index_page_footer = 9;
       // present only when type == DICTIONARY_PAGE
       optional DictPageFooterPB dict_page_footer = 10;
       // present only when type == SHORT_KEY_PAGE
       optional ShortKeyFooterPB short_key_page_footer = 11;
   }
   ```
   
   Each type of page has an entry in PageTypePB and a custom footer defined as 
a field in PageFooterPB. In this way, adding a new type of page in the future 
is a trivial work.
   
   Now we describe the page content and custom footer of each page type.
   
   First, **DataPageContent** is composed of encoded values and nullmap.
   
   ```
   DataPageContent := EncodedValue, Byte^nullmap_size
   - nullmap_size is record in DataPageFooterPB
   ```
   ```protobuf
   message DataPageFooterPB {
       // required: ordinal of the first value
       optional uint64 first_ordinal = 1;
       // required: number of values, including NULLs
       optional uint64 num_values = 2;
       // required: size of nullmap, 0 if the page doesn't contain NULL
       optional uint32 nullmap_size = 3;
       // only for array column, largest array item ordinal + 1,
       // used to calculate the length of last array in this page
       optional uint64 next_array_item_ordinal = 4;
   }
   ```
   
   Note that `first_ordinal` and `num_values` are changed from uint32 to uint64 
because page stores array items may store more than 2^32 elements in theory. 
Also note that array column needs an additional `next_array_item_ordinal` field 
to calculate the length of last array in that page.
   
   Second, **IndexPageContent** is composed of index entries.
   ```
   IndexEntry := KeySize(vint32), Byte^KeySize, PagePointer
   IndexPageContent := IndexEntry^num_entries
   ```
   ```protobuf
   // same as before
   message IndexPageFooterPB {
     // required: number of index entries in this page
     optional int32 num_entries = 1;
   
     enum Type {
       UNKNOWN_INDEX_PAGE_TYPE = 0;
       LEAF = 1;
       INTERNAL = 2;
     };
     // required: type of the index page
     optional Type type = 2;
   }
   ```
   
   Third, **DictPageContent** contains encoded values, the first value has 
codeword 0, the second value has codeward 1, and so on. The encoding type is 
recorded in footer.
   
   ```protobuf
   message DictPageFooterPB {
       // required: encoding for dictionary
       optional EncodingTypePB encoding = 1;
   }
   ```
   
   Finally, **ShortKeyPageContent** and ShortKeyFooterPB are the same as before.
   ```protobuf
   // same as before
   message ShortKeyFooterPB {
       // How many index item in this index.
       optional uint32 num_items = 1;
       // The total bytes occupied by the index key
       optional uint32 key_bytes = 2;
       // The total bytes occupied by the key offsets
       optional uint32 offset_bytes = 3;
       // Segment id which this index is belong to
       optional uint32 segment_id = 4;
       // number rows in each block
       optional uint32 num_rows_per_block = 5;
       // How many rows in this segment
       optional uint32 num_segment_rows = 6;
       // Total bytes for this segment
       optional uint32 segment_bytes = 7;
   }
   ```
   
   ### Alternative to SimplePage
   
   Previously we use SimplePage for dictionary, ordinal index of normal column, 
and zonemap index. We doesn't have SimplePage in the new design since it's not 
extendable. This section describes the alternative solution to it.
   
   Dictionary page is stored in DictionaryPage as talked above.
   
   For ordinal index, previously we have two format. Normal column's ordinal 
index uses SimplePage while IndexedColumn's ordinal index uses IndexPage. This 
can be unified into IndexPage because the latter supports B-Tree style 
multi-level indexing.
   
   Zonemap index is stored in one IndexedColumn with ordinal index. When all 
zonemap fits in one page, it's essentially stored in one data page, with each 
value be a serialized ZoneMapPB message.
   
   ## Important notice
   The new design is incompatible with old design. User needs to reload data 
for V2 tables.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to