This is an automated email from the ASF dual-hosted git repository.
jiafengzheng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new 39fe56efe8f remove
39fe56efe8f is described below
commit 39fe56efe8f8961190e6f1568bce5d63469ceef2
Author: jiafeng.zhang <[email protected]>
AuthorDate: Fri Jun 10 17:08:16 2022 +0800
remove
remove
---
.../doris-storage-reader-compaction.md | 228 ----------------
.../DorisInternals/doris-storage-struct-design.md | 303 ---------------------
.../DorisInternals/doris-storage-writer-delete.md | 150 ----------
3 files changed, 681 deletions(-)
diff --git a/blogs/en/DorisInternals/doris-storage-reader-compaction.md
b/blogs/en/DorisInternals/doris-storage-reader-compaction.md
deleted file mode 100644
index 7672ad38eb8..00000000000
--- a/blogs/en/DorisInternals/doris-storage-reader-compaction.md
+++ /dev/null
@@ -1,228 +0,0 @@
----
-{
- "title": "Apache Doris storage layer design three reading process,
Compaction process analysis",
- "description": "This article introduces in detail the internal
implementation process of the Doris system during the data writing process, as
well as the implementation process of Doris's conditional deletion of data and
batch deletion by key.",
- "date": "2022-05-20",
- "metaTitle": "Apache Doris storage layer design three reading process,
Compaction process analysis",
- "isArticle": true,
- "language": "en",
- "author": "ApacheDoris",
- "layout": "Article",
- "sidebar": false,
- "categories": "DorisInternals",
-}
----
-
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements. See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership. The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied. See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-# Apache Doris storage layer design three reading process, Compaction process
analysis
-
-## 1 Overall introduction
-
-Doris is an interactive SQL data warehouse based on MPP architecture, mainly
used to solve near real-time reports and multi-dimensional analysis. The
efficient import and query of Doris is inseparable from the sophisticated
design of its storage structure.
-
-This article mainly analyzes the implementation principle of the storage layer
of the Doris BE module by reading the code of the Doris BE module, and expounds
and decrypts the core technology behind the efficient writing and query
capabilities of Doris. It includes Doris column storage design, index design,
data read and write process, Compaction process and other functions.
-
-This article introduces in detail the internal implementation process of the
Doris system during the data writing process, as well as the implementation
process of Doris's conditional deletion of data and batch deletion by key.
-
-## 2 Read process
-
-### 2.1 Overall reading process
-
-The read process is the reverse process of writing, but the read process is
relatively complicated, mainly because of a large number of read optimizations.
The entire reading process is divided into two stages, one is the init process,
and the other is the process of obtaining the next_block data block. The
specific process is shown in the following figure:
-
-
-
-The hierarchical relationship is as follows:
-
-OlapScanner encapsulates the overall read operation of a tablet data;
-
-Reader processes the read parameters, and provides differentiated processing
for reading according to three different models;
-
-CollectIterator contains multiple RowsetReaders in the tablet. These
RowsetReaders have version order. CollectIterator merges these RowsetReaders
into a unified Iterator function and provides a merged comparator;
-
-RowsetReader is responsible for reading a Rowset;
-
-RowwiseIterator provides an Iterator function for unified access to all
Segments in a Rowset. The merge strategy here can use Merge or Union according
to the data sorting;
-
-SegmentIterator corresponds to the data read of a segment. The segment read
will calculate the corresponding line number information read according to the
query conditions and the index, seek to the corresponding page, and read the
data. Among them, after filtering conditions, a bitmap will be generated for
the accessible row information to record, BitmapRangeIterator is a separate
iterator that can access this bitmap according to the range;
-
-ColumnIterator provides an iterator for uniform access to a column's related
data and indexes. ColumnReader, each IndexReader, etc. correspond to the
reading of specific data and index information.
-
-### 2.2 The main process of reading the Init stage
-
-The execution flow of the initialization phase is as follows:
-
-
-
-#### 2.2.1 OlapScanner query parameter construction
-
-Find the RowsetReader that needs to be read according to the version specified
by the query (depending on the rowset_graph version path map of version
management to obtain the shortest path of the query version range);
-
-1. Set the query information, including \_tablet, read type
reader_type=READER_QUERY, whether to aggregate, \_version (from 0 to the
specified version);
-
-2. Set query condition information, including filter field and is_nulls field;
-
-3. Set the return column information;
-
-4. Set the key_ranges range of the query (the range array of keys, which can
be filtered by short key index);
-
-5. Initialize the Reader object.
-
-#### 2.2.2 Reader's Init process
-
-1. Initialize the conditions query condition object;
-
-2. Initialize the bloomFilter column set (eq, in conditions, columns with
bloomFilter added);
-
-3. Initialize delete_handler. It includes all the deletion information
existing in the tablet, including the version and the corresponding deletion
condition array;
-
-4. Initialize the columns that are passed to the lower layer to be read and
returned, including the return value and the columns in the condition object;
-
-5. Initialize the RowCusor row cursor object corresponding to the start key
and end key of key_ranges;
-
-6. Set up RowsetReader and CollectIterator for the constructed information.
The Rowset object is initialized, and the RowsetReader is added to the
CollectIterator;
-
-7. Call CollectIterator to get the current row (actually the first row here),
start the reading process here, and read it for the first time.
-
-#### 2.2.3 Init process of RowsetReader
-
-Build a SegmentIterator and filter out delete conditions in delete_handler
that are smaller than the current Rowset version;
-
-Build a RowwiseIterator (an aggregate iterator for SegmentIterator), and add
the SegmentIterator to be read to the RowwiseIterator. When all segments are in
overall order, the sequential reading method of union iterator is adopted,
otherwise, the merge iterator method of merged reading is adopted.
-
-#### 2.2.4 Init process of Segmentlterator
-
-1. Initialize the ReadableBlock, which is used to read the object of the
current Segment file, and actually read the file;
-
-2. Initialize \_row_bitmap to store the row number filtered by the index,
using the bitmap structure;
-
-3. Build a ColumnIterator, where only the columns need to be read;
-
-If the Column has a BitmapIndex index, initialize the BitmapIndexIterator of
each Column;
-
-Filter data by SortkeyIndex index. When the query has key_ranges, obtain the
row number range of the hit data through key_range. The steps are as follows:
(1) According to the upper and lower keys of each key_range, find the
corresponding row numbers upper_rowid and lower_rowid through the SortkeyIndex
index of Segment, and then merge the obtained RowRanges into row_bitmap;
-
-Filter data conditionally by various indexes. Conditions include query
conditions and delete conditions to filter information.
-
-- According to the query conditions, use the bitmap index to filter the
columns that contain the bitmap index in the condition, and query the row
number list with the existing data to intersect the row_bitmap. Because it is
precise filtering, delete the filtering conditions from the Condition object.
-- Use the BloomFilter index to filter data according to the equivalent (eq,
in, is) conditions in the query conditions. Here, it will be judged whether the
current condition can hit the Page, and the row number range of this Page will
be intersected with the row_bitmap.
-- Use ZoneMapIndex to filter data according to query conditions and deletion
conditions, and find the pages that meet the conditions by intersecting the
index of each Page in ZoneMap. The range of row numbers matched by the
ZoneMapIndex index is intersected with row_bitmap.
-
-Use row_bitmap to construct a BitmapRangerInterator iterator for subsequent
reading of data.
-
-### 2.3 The main process of reading the next stage
-
-The execution flow of the next stage is as follows:
-
-
-
-#### 2.3.1 Reader reads next_row_with_aggregation
-
-Read a line in advance when the reader reads, record as the current line. When
next is called to return the result, the current row will be returned, and then
the next row will be prefetched as the new current row.
-
-The reading of the reader will be divided into three cases according to the
type of the model
-
-\_dup_key_next_row reads (detailed data model), returns the current row, and
then directly reads CollectorIterator to read next as the current row;
-
-Under \_agg_key_next_row reading (aggregation model), after taking
CollectorIterator to read next, determine whether the next row is the same as
the key of the current row, if it is the same, perform aggregation calculation,
and read the next row in a loop; if not, return the current accumulated
aggregation result, update the current row;
-
-Under \_unique_key_next_row reading (unique key model), the logic is the same
as the \_agg_key_next_row model, but there are some differences. Since the
delete operation is supported, it will check whether the current row after
aggregation is marked as a deleted row. If data is discarded for a deleted row,
it will not be returned until a data is found that is not a deleted row.
-
-#### 2.3.2 CollectIterator reads next
-
-CollectIterator uses the heap data structure to maintain the set of
RowsetReaders to be read. The comparison rules are as follows: According to the
order of the keys of the current row of each RowsetReader, when the keys are
the same, compare the version of the Rowset.
-
-CollectIterator pops the previous largest RowsetReader from the heap;
-
-Read the next new row for the RowsetReader just popped out as the current row
of the RowsetReader and put it into the heap for comparison. During the reading
process, the nextBlock of RowsetReader is called to read by RowBlock. (If the
currently fetched block is a partially deleted page, the current row is also
filtered according to the deletion condition.)
-
-Get the current row of the RowsetReader at the top of the queue and return it
as the current row.
-
-#### 2.3.3 RowsetReader reads next
-
-RowsetReader directly reads next_batch of RowwiseIterator;
-
-RowwiseIterator integrates SegmentIterator. When the Segments in the Rowset
are ordered as a whole, iteratively returns directly in the Union mode. When
out of order, return by Merge. RowwiseIterator also returns the row data of the
current largest SegmentIterator, and each time the next_batch of
SegmentIterator is called to get the data.
-
-#### 2.3.4 SegmentIterator reads next_batch
-
-According to the BitmapRangerInterator constructed in the init phase, use
next_range to take out a range_from, range_to of the line number to be read
each time;
-
-First read the data of the condition column from range_from to range_to row.
The process is as follows:
-
-Call the seek_to_ordinal of each columnIterator of the conditional column, and
the current_rowid of the read position of each column is located to the
cur_rowid of the SegmentIterator. Here is the alignment to the corresponding
data page by binary check ordinal_index.
-
-Read the data of the condition column. Do one more filter by condition (this
time exact filter).
-
-Then read the data of the unconditional column, put it into the Rowblock, and
return to the Rowblock.
-
-## 3 Compaction process
-
-### 3.1 Overall Introduction of Compaction
-
-Doris improves the performance of incrementally aggregated Rowset files
through Compaction. In the version information of Rowset, two fields, first and
second, are designed to represent the merged version range of Rowset. When the
versions first and second of the unmerged cumulative rowset are equal. During
Compaction, adjacent Rowsets will be merged to generate a new Rowset, and the
first and second of the version information will also be merged into a larger
version. On the other hand, [...]
-
-
-
-As shown in the figure above, there are two types of Compaction tasks, base
compaction and cumulative compaction. The cumulative_point is the key to
dividing the two strategies.
-
-It can be understood in this way that the right side of cumulative_point is
the incremental Rowset that has never been merged, and the first and second
versions of each Rowset are equal; the left side of cumulative_point is the
merged Rowset, and the first version is not equal to the second version. The
base compaction and cumulative compaction task processes are basically the
same, and the difference is only in the logic of selecting the InputRowset to
be merged.
-
-### 3.2 Detailed process of Compaction
-
-The overall process of Compaction merger is shown in the following figure:
-
-
-
-#### 3.2.1 Calculate cumulative_point
-
-Select the set of InputRowsets that need to be merged for compaction:
-
-Base compaction selection conditions:
-
-1. When there are more than 5 non-cumulative rowsets, merge all non-cumulative
rowsets;
-
-2. When the ratio of the base rowset whose version first is 0 and other
non-cumulative disks is less than 10:3, merge all non-cumulative rowsets for
merging;
-
-3. In other cases, the merger will not be carried out.
-
-Selection criteria for cumulative compaction:
-
-1. The number of segments in the selected Rowset set needs to be greater than
or equal to 5 and less than or equal to 1000 (configurable), and merge; 2.
-2. When the number of output Rowsets is less than 5, but the deletion
condition version is greater than the Rowset second version, merge (let the
deleted Rowsets be merged in quickly);
-3. When both the accumulated base compaction and cumulative compaction time
are greater than 1 day, merge;
-4. Other cases are not combined.
-
-#### 3.2.2 Execute compaction
-
-Compaction execution can basically be understood as a read process plus a
write process. Here, the Reader will be turned on for the inputRowsets to be
merged, and then the records will be read through next_row_with_aggregation.
Write to the output RowsetWriter to produce a new OutputRowset. The version of
this Rowset is the full range of the InputRowsets version.
-
-#### 3.2.3 update cumulative_point
-
-Update cumulative_point and pass the OutputRowset produced by cumulative
compaction to the subsequent base compaction process.
-
-After Compaction, the aggregation key model and the unique key model scattered
in different Rowsets but with the same key data are merged to achieve the
effect of pre-computing. At the same time, the number of Rowset files is
reduced, and the query efficiency is improved.
-
-## 4 Summary
-
-This article introduces the read-related process of the underlying storage
layer of the Doris system in detail.
-
-The reading process depends on the complete column storage implementation. For
OLAP wide table scenarios (reading a large number of rows, a small number of
columns), it can quickly scan and filter based on various index functions
(including short key, bloom filter, zoon map, bitmap, etc. ), which can skip a
large number of data scans, and optimizes such as delayed materialization,
which can correspond to data analysis in various scenarios; the Compaction
execution process is also optimiz [...]
diff --git a/blogs/en/DorisInternals/doris-storage-struct-design.md
b/blogs/en/DorisInternals/doris-storage-struct-design.md
deleted file mode 100644
index 7c3e501c9fa..00000000000
--- a/blogs/en/DorisInternals/doris-storage-struct-design.md
+++ /dev/null
@@ -1,303 +0,0 @@
----
-{
- "title": "Analysis of storage structure design one of Apache Doris storage
layer design",
- "description": "This article mainly analyzes the implementation principle of
the storage layer of the Doris BE module by reading the code of the Doris BE
module, and expounds and decrypts the core technology behind the efficient
writing and query capabilities of Doris. It includes Doris column storage
design, index design, data read and write process, Compaction process, version
management of Tablet and Rowset, data backup and other functions.",
- "date": "2022-05-20",
- "metaTitle": "Analysis of storage structure design one of Apache Doris
storage layer design",
- "isArticle": true,
- "language": "en",
- "author": "ApacheDoris",
- "layout": "Article",
- "sidebar": false,
- "categories": "DorisInternals",
-}
----
-
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements. See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership. The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied. See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-# Analysis of storage structure design one of Apache Doris storage layer design
-
-## 1. Overall introduction
-
-Doris is an interactive SQL data warehouse based on MPP architecture, mainly
used to solve near real-time reporting and multidimensional analysis. Doris's
efficient import and query are inseparable from the sophisticated design of its
storage structure.
-
-This article mainly analyzes the implementation principle of the storage layer
of the Doris BE module by reading the code of the Doris BE module, and expounds
and decrypts the core technology behind the efficient writing and query
capabilities of Doris. It includes Doris column storage design, index design,
data read and write process, Compaction process, version management of Tablet
and Rowset, data backup and other functions.
-
-This article introduces the storage layer structure of the Segment V2 version,
including rich functions such as ordered storage, sparse index, prefix index,
bitmap index, BloomFilter, etc., which can provide extremely fast query
capabilities for various complex scenarios.
-
-## 2 Design goals
-
-- Bulk import, few updates
-- the vast majority of read requests
-- Wide table scenario, read a lot of rows, few columns
-- Non-transactional scenarios
-- good scalability
-
-## 3 save file format
-
-### 3.1 Storage directory structure
-
-The storage layer's management of storage data is configured through the
storage_root_path path, which can be multiple. The next layer of the storage
directory is organized according to buckets. Specific tablets are stored in the
bucket directory, and subdirectories are named according to tablet_id.
-
-Segment files are stored in the tablet*id directory and managed according to
SchemaHash. There can be multiple Segment files, generally divided according to
size, the default is 256MB. Among them, the segment v2 file naming rule is:
${rowset_id}*${segment_id}.dat.
-
-The specific storage directory storage format is shown in the following figure:
-
-
-
-### 3.2 Segment v2 file structure
-
-The overall file format of Segment is divided into three parts: data area,
index area and footer, as shown in the following figure:
-
-
-
-- Data Region: used to store the data information of each column, the data
here is loaded in pages as needed
-- Index Region: Doris stores the index data of each column uniformly in the
Index Region. The data here will be loaded according to the column granularity,
so it is stored separately from the column data information.
-- Footer information
-- SegmentFooterPB: Define the metadata information of the file
-- 4 bytes checksum of FooterPB content
-- 4 bytes of FileFooterPB message length for reading FileFooterPB
-
-The following distribution introduces the design of the storage format of each
part.
-
-## 4 Footer Information
-
-The Footer information segment is at the end of the file, which stores the
overall structure of the file, including the location of the data field, the
location of the index field and other information, including SegmentFooterPB,
CheckSum, Length, MAGIC CODE 4 parts.
-
-SegmentFooterPB data structure is as follows:
-
-
-
-SegmentFooterPB uses the PB format for storage, which mainly includes the meta
information of the column, the meta information of the index, the short key
index information of the segment, and the total number of rows.
-
-### 4.1 Column meta information
-
-ColumnId: the serial number of the current column in the schema
-
-UniqueId: globally unique id
-
-Type: the type information of the column
-
-Length: the length information of the column
-
-Encoding: encoding format
-
-Compression: Compression format
-
-Dict PagePointer: Dictionary information
-
-### 4.2 Meta information of column index
-
-- OrdinalIndex: Stores the sparse index meta information of the column.
-- ZoneMapIndex: Stores the meta information of the ZoneMap index, including
the maximum value, the minimum value, whether there is a null value, and
whether there is no non-null value. SegmentZoneMap stores the global ZoneMap
information, and PageZoneMaps stores the statistical information of each page.
-- BitMapIndex: Stores the meta information of BitMap index, including BitMap
type, dictionary data BitMap data.
-- BloomFilterIndex: Stores the BloomFilter index information.
-
-In order to prevent the data volume of the index itself from being too large,
ZoneMapIndex, BitMapIndex, and BloomFilterIndex adopt two-level Page
management. Corresponding to the structure of IndexColumnMeta, when a Page can
be put down, the current Page directly stores the index data, that is, a level
1 structure is adopted; when a Page cannot be put down, the index data is
written into a new Page, and the Root Page stores the address information of
the data Page .
-
-## 5 Ordinal Index
-
-The Ordinal Index index provides the physical address of the Column Data Page
data page by row number. Ordinal Index can align column-stored data by row,
which can be understood as a first-level index. When looking for data in other
indexes, the Ordinal Index is used to find the location of the data Page.
Therefore, the Ordinal Index index is introduced here first.
-
-In a segment, data is always stored in the sorted order of keys (AGGREGATE
KEY, UNIQ KEY, and DUPLICATE KEY), that is, the sorting of keys determines the
physical structure of data storage. The physical structure order of the column
data is determined. When writing data, the Column Data Page is managed by the
Ordinal index. The Ordinal index records the position offset, size and row
number information of the first data item of each Column Data Page. Namely
Ordinal. In this way, each colu [...]
-
-### 5.1 Storage structure
-
-Ordinal index meta information is stored in OrdinalIndexMeta for each column
in SegmentFooterPB . The specific structure is shown in the following figure:
-
-
-
-The root page address corresponding to the index data is stored in
OrdinalIndexMeta. Some optimizations are made here. When the data has only one
page, the address here can directly point to the only data page; when a page
cannot be placed, it points to the second page of the OrdinalIndex type
Hierarchical structure index page, each data item in the index data corresponds
to the Column Data Page offset position, size and ordinal row number
information. The Ordinal index index granularity [...]
-
-## 6 column data store
-
-### 6.1 data page storage structure
-
-DataPage is mainly divided into two parts: Data part and Page Footer.
-
-The Data section stores the data of the columns of the current Page. When the
Null value is allowed, the Bitmap of the Null value is stored separately for
the null value, and the row number of the Null value is recorded by the RLE
format encoding through the bool type.
-
-
-
-Page Footer contains Page type Type, UncompressedSize uncompressed data size,
FirstOrdinal RowId of the first row of the current Page, NumValues is the
number of rows of the current Page, NullMapSize corresponds to the size of
NullBitmap.
-
-## 6.2 Data compression
-
-Different encodings are used for different field types. By default, the
correspondences adopted for different types are as follows:
-
-
-
-The data is compressed in LZ4F format by default.
-
-## 7 Short Key Index Index
-
-### 7.1 Storage structure
-
-Short Key Index prefix index is an index method to quickly query data
according to a given prefix column based on the sorting of keys (AGGREGATE KEY,
UNIQ KEY and DUPLICATE KEY). Here, the Short Key Index index also adopts a
sparse index structure. During the data writing process, an index item will be
generated every certain number of rows. The number of rows is 1024 rows by
default for the index granularity, which can be configured. The process is
shown in the following diagram:
-
-
-
-Among them, KeyBytes stores the index item data, and OffsetBytes stores the
offset of the index item in KeyBytes.
-
-### 7.2 Index Generation Rules
-
-The Short Key Index uses the first 36 bytes as the prefix index for this row
of data. Prefix indexes are simply truncated when a VARCHAR type is encountered.
-
-### 7.3 Application Cases
-
-(1) The prefix index of the following table structure is user_id(8Byte) +
age(4Bytes) + message(prefix 24 Bytes).
-
-
-
-(2) The prefix index of the following table structure is user_name (20 Bytes).
Even if it does not reach 36 bytes, because VARCHAR is encountered, it is
directly truncated and will not continue further.
-
-
-
-When our query condition is the prefix of the prefix index, the query speed
can be greatly accelerated. For example, in the first example, we execute the
following query:
-
-```sql
-SELECT * FROM table WHERE user_id=1829239 and age=20;
-```
-
-This query will be much more efficient than the following query:
-
-```sql
-SELECT * FROM table WHERE age=20;
-```
-
-Therefore, when building a table, choosing the correct column order can
greatly improve query efficiency.
-
-## 8 ZoneMap Index index
-
-The ZoneMap index stores the statistics of the Segment and each column
corresponding to each Page. These statistics can help speed up the query and
reduce the amount of scanned data. The statistics include the maximum value of
Min, the minimum value of Max, HashNull null value, and HasNotNull not all null
information.
-
-### 8.1 Storage structure
-
-The index storage structure of ZoneMap is shown in the following figure:
-
-
-
-In the SegmentFootPB structure, each column of index metadata ColumnIndexMeta
stores the ZoneMapIndex index data information of the current column.
ZoneMapIndex has two parts, SegmentZoneMap and PageZoneMaps. SegmentZoneMap
stores the global ZoneMap index information of the current Segment, and
PageZoneMaps stores the ZoneMap index information of each Data Page.
-
-PageZoneMaps corresponds to the IndexedColumnMeta structure of the Page
information stored in the index data. Currently, there is no compression in the
implementation, and the encoding method is also Plain. The OrdinalIndexPage in
IndexedColumnMeta points to the offset and size of the root page of the index
data. The second-level Page optimization is also done here. When there is only
one DataPage, OrdinalIndexMeta directly points to this DataPage; when there are
multiple DataPages, Ordi [...]
-
-### 8.2 Index Generation Rules
-
-Doris opens the ZoneMap index for the key column by default; when the model of
the table is DUPULCATE, the ZoneMap index is enabled for all fields. When the
column data is written to the Page, the data is automatically compared, and the
index information of the ZoneMap of the current Segment and the ZoneMap of the
current Page is continuously maintained.
-
-### 8.3 Application Cases
-
-During data query, the fields that will be filtered according to the range
conditions will select the scanned data range according to the ZoneMap
statistics. For example, in case 1, filter on the age field. The query
statement is as follows:
-
-```sql
-SELECT * FROM table WHERE age > 20 and age < 1000
-```
-
-If the Short Key Index is not hit, it will use the ZoneMap index to find the
ordinary range of data that should be scanned according to the query conditions
of age in the conditional statement, reducing the number of pages to be scanned.
-
-## 9 BloomFilter
-
-Doris provides BloomFilter index when some fields cannot use Short Key Index
and the field has a high degree of discrimination.
-
-### 9.1 Storage structure
-
-The storage structure of BloomFilter is shown in the following figure::
-
-
-
-The BloomFilterIndex information stores the produced Hash strategy, Hash
algorithm and the corresponding data Page information of BloomFilter. Hash
algorithm adopts HASH_MURMUR3, Hash strategy adopts BlockSplitBloomFilter block
implementation strategy, and the expected false positive rate fpp is configured
to be 0.05 by default.
-
-The storage of data pages corresponding to BloomFilter index data is similar
to that of ZoneMapIndex, and the optimization of secondary pages has been done,
which will not be described in detail here.
-
-### 9.2 Index Generation Rules
-
-BloomFilter is generated by Page granularity. When data is written to a
complete Page, Doris will generate the BloomFilter index data of this Page at
the same time according to the Hash strategy. Currently bloom filter does not
support tinyint/hll/float/double types, other types are already supported. When
using, you need to specify bloom_filter_columns in PROPERTIES The fields to be
indexed by BloomFilter.
-
-### 9.3 Application Cases
-
-When querying data, the query conditions are filtered in the field with bloom
filter set. When the bloom filter is not hit, it means that there is no such
data in the page, which can reduce the number of pages to be scanned.
-
-Case: The schema of the table is as follows:
-
-
-
-The query sql here is as follows:
-
-```sql
-SELECT * FROM table WHERE name = 'Zhang San'
-```
-
-Due to the high degree of discrimination of name, in order to improve the
query performance of sql, a BloomFilter index, PROPERTIES (
"bloom_filter_columns" = "name" ), is added to the name data. At query time,
the BloomFilter index can filter out a large number of Pages.
-
-## 10 Bitmap Index index
-
-Doris also provides BitmapIndex to speed up data queries.
-
-## 10.1 Storage structure
-
-Bitmap storage format is as follows:
-
-
-
-The meta information of BitmapIndex is also stored in SegmentFootPB.
BitmapIndex includes three parts, BitMap type, dictionary information
DictColumn, and bitmap index data information BitMapColumn. Among them,
DictColumn and BitMapColumn correspond to the IndexedColumnData structure, and
store the Page address offset and size of dictionary data and index data
respectively. The optimization of the secondary page is also done here, and
will not be explained in detail.
-
-The difference between this and other index storage structures is that the
DictColumn dictionary data is LZ4F compressed, and the first value in the Data
Page is stored when the secondary Page offset is recorded.
-
-### 10.2 Index Generation Rules
-
-When creating a BitMap, it needs to be created through CREATE INDEX. The index
of the Bitmap is the index of the Column field in the entire Segment, rather
than generating a separate copy for each Page. When writing data, a map
structure is maintained to record the row number corresponding to each key
value, and the Roaring bitmap is used to encode the rowid. The main structure
is as follows:
-
-
-
-When generating index data, the dictionary data is first written, and the key
value of the map structure is written into the DictColumn. Then, the key
corresponds to the Roaring-encoded rowid to write data into BitMapColumn in
bytes.
-
-### 10.3 Application Cases
-
-When querying data, bitmap indexes can be used to optimize data columns with
small degrees of differentiation and small column cardinality. For example,
gender, marriage, geographic information, etc.
-
-Case: The schema of the table is as follows:
-
-
-
-The query sql here is as follows:
-
-```sql
-SELECT * FROM table WHERE city in ("Beijing", "Shanghai")
-```
-
-Since the value of city is relatively small, after the data dictionary and
bitmap are established, matching rows can be quickly found by scanning the
bitmap. And after bitmap compression, the amount of data itself is small, and
the entire column can be accurately matched by scanning less data.
-
-## 11 Index query process
-
-When querying data in a Segment, according to the query conditions executed,
the data will be filtered first according to the field indexing. Then read the
data, the overall query process is as follows:
-
-
-
-1. First, a row_bitmap will be constructed according to the number of rows in
the Segment, indicating that the data needs to be read to record. If no index
is used, all data needs to be read.
-2. When the key is used in the query condition according to the prefix index
rule, the ShortKey Index will be filtered first, and the ordinal row number
range matched in the ShortKey Index can be merged into the row_bitmap.
-3. When there is a BitMap Index index in the column field in the query
condition, the ordinal row number that meets the conditions will be directly
found according to the BitMap index, and the intersection filter with
row_bitmap will be obtained. The filtering here is accurate, and after removing
the query condition, this field will not be filtered by the subsequent index.
-4. When there is a BloomFilter index in the column field in the query
condition and the condition is equal (eq, in, is), it will be filtered by the
BloomFilter index, here will go through all the indexes, filter the BloomFilter
of each Page, and find out the query condition can be All Pages hit. Intersect
the ordinal row number range in the index information with row_bitmap.
-5. When there is a ZoneMap index in the column field in the query condition,
it will be filtered by the ZoneMap index. Here, all the indexes will also be
traversed to find all the pages that the query condition can intersect with the
ZoneMap. Intersect the ordinal row number range in the index information with
row_bitmap.
-6. After the row_bitmap is generated, find the specific Data Page in batches
through the OrdinalIndex of each Column.
-7. Batch read the data of the Column Data Page of each column. When reading,
for a page with a null value, judge whether the current row is null according
to the null value bitmap. If it is null, it can be filled directly.
-
-## 12 Summary
-
-Doris currently adopts a complete column storage structure and provides rich
indexes to deal with different query scenarios, laying a solid foundation for
Doris's efficient writing and query performance. The Doris storage layer is
designed to be flexible, and functions such as new indexes and enhanced data
deletion can be further added in the future.
diff --git a/blogs/en/DorisInternals/doris-storage-writer-delete.md
b/blogs/en/DorisInternals/doris-storage-writer-delete.md
deleted file mode 100644
index e6b6efbf545..00000000000
--- a/blogs/en/DorisInternals/doris-storage-writer-delete.md
+++ /dev/null
@@ -1,150 +0,0 @@
----
-{
- "title": "Apache Doris storage layer design two write process, delete
process analysis",
- "description": "This article introduces in detail the internal
implementation process of the Doris system during the data writing process, as
well as the implementation process of Doris's conditional deletion of data and
batch deletion by key.",
- "date": "2022-05-20",
- "metaTitle": "Apache Doris storage layer design two write process, delete
process analysis",
- "isArticle": true,
- "language": "en",
- "author": "ApacheDoris",
- "layout": "Article",
- "sidebar": false,
- "categories": "DorisInternals",
-}
----
-
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements. See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership. The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied. See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-# Apache Doris storage layer design two write process, delete process analysis
-
-## 1. Overall introduction
-
-Doris is an interactive SQL data warehouse based on MPP architecture, mainly
used to solve near real-time reporting and multidimensional analysis. Doris's
efficient import and query are inseparable from the sophisticated design of its
storage structure.
-
-This article mainly analyzes the implementation principle of the storage layer
of the Doris BE module by reading the code of the Doris BE module, and expounds
and decrypts the core technology behind the efficient writing and query
capabilities of Doris. It includes Doris column storage design, index design,
data read and write process, Compaction process, version management of Tablet
and Rowset, data backup and other functions.
-
-This article introduces the storage layer structure of the Segment V2 version,
including rich functions such as ordered storage, sparse index, prefix index,
bitmap index, BloomFilter, etc., which can provide extremely fast query
capabilities for various complex scenarios.
-
-This article introduces in detail the internal implementation process of the
Doris system during the data writing process, as well as the implementation
process of Doris's conditional deletion of data and batch deletion by key
-
-## 2 Glossary
-
-- **FE:** Frontend, the front-end node of Doris. It is mainly responsible for
receiving and returning client requests, metadata, cluster management, and
query plan generation.
-- **BE:** Backend, the backend node of Doris. Mainly responsible for data
storage and management, query plan execution and other work.
-- **Tablet:** Tablet is the actual physical storage unit of a table. A table
is stored in units of Tablet in the distributed storage layer formed by BE
after partitioning and bucketing. Each Tablet includes meta information and
several a continuous RowSet.
-- **Rowset: ** Rowset is the data set of a data change in the Tablet, and the
data change includes data import, deletion, update, etc. Rowset records by
version information. A version is generated for each change.
-- **Version:** consists of two attributes, Start and End, and maintains the
record information of data changes. Usually used to indicate the version range
of Rowset, after a new import generates a Rowset with equal Start and End, and
after Compaction generates a Rowset version with a range.
-- **Segment:** Indicates the data segment in the Rowset. Multiple Segments
form a Rowset.
-- **Compaction:** The process of merging consecutive versions of Rowset is
called Compaction, and the data is compressed during the merging process.
-
-## 3 Write process
-
-Doris supports various forms of data writing methods for different scenarios,
including importing Broker Load from other storage sources, importing HTTP
synchronous data into Stream Load, routine Routine Load import and Insert Into
writing, etc. At the same time, the import process will involve FE module
(mainly responsible for import plan generation and import task scheduling), BE
module (mainly responsible for ETL and storage of data), and Broker module
(providing Doris with the abilit [...]
-
-The following takes Stream Load writing as an example to describe the overall
data writing process of Doris as shown in the following figure:
-
-
-
-The process is described as follows:
-
-1. FE receives the user's write request and randomly selects BE as the
Coordinator BE. Redirect the user's request to this BE.
-2. The Coordinator BE is responsible for receiving the user's data write
request, and at the same time requesting the FE to generate an execution plan
and schedule and manage the import task LoadJob and import transaction.
-3. The Coordinator BE schedules the execution of the import plan, and performs
data verification and cleaning.
-4. The data is written to the storage layer of the BE. In this process, it
will be written to the memory first, and after a certain amount of data is
filled, it will be written to the physical disk according to the data format of
the storage layer.
-
-This article mainly introduces the detailed process of writing data to the BE
storage layer. The rest of the process is not described in detail.
-
-### 3.1 Data distribution process
-
-After the data is cleaned and filtered, the data will be sent to the BE nodes
of the storage layer in batches through Open/AddBatch requests. Multiple
LoadJob tasks are supported concurrently for concurrent write execution on a
BE. LoadChannelMgr manages these tasks and distributes the data.
-
-The data distribution and writing process is shown in the following figure:
-
-
-
-1. Each time an import task LoadJob will create a LoadChannel to execute,
LoadChannel maintains an imported channel, and LoadChannel can write data in
batches until the import is complete.
-
-2. LoadChannel will create a TabletsChannel to perform specific import
operations. A TabletsChannel corresponds to multiple Tablets. In a batch data
write operation, TabletsChannel distributes the data to the corresponding
Tablet, and the DeltaWriter writes the data to the Tablet, and the real write
operation begins.
-
-### 3.2 DeltaWriter and Memtable
-
-DeltaWriter is mainly responsible for continuously receiving newly written
batches of data and completing the data writing of a single Tablet. Since the
new data can be incremental Delta parts, it is called DeltaWriter.
-
-DeltaWriter uses an LSM tree-like structure for data writing. The data is
first written to the Memtable. When the Memtable data is full, it will
asynchronously flush to generate a Segment for persistence, and at the same
time generate a new Memtable to continue to receive new data for import. This
flush operation is done by the MemtableFlushExecutor executor.
-
-In Memtable, the skip table structure is used to sort the data, and the
sorting rule uses the order of the keys of the schema to compare the fields in
turn. This ensures that the data written in each write segment is ordered. If
the current model is a non-DUP model (AGG model and UNIQUE model), the data of
the same key will also be aggregated.
-
-### 3.3 Physical Write
-
-#### 3.3.1 RowsetWriter module design
-
-Writing at the physical storage level is done by RowsetWriter. RowsetWriter is
further divided into sub-modules such as SegmentWriter, ColumnWriter,
PageBuilder, and IndexBuilder.
-
-1. RowsetWriter completes the writing of an import LoadJob task as a whole,
and an import LoadJob task will generate a Rowset, and a Rowset represents the
data version that is successfully imported once. In implementation,
RowsetWriter is responsible for completing the writing of Rowset.
-2. SegmentWriter is responsible for implementing Segment writing. A Rowset
can consist of multiple Segment files.
-3. ColumnWriter is included in SegmentWriter. The segment file is a complete
column storage structure. Segment contains each column and related index data.
The writing of each column is responsible for writing by ColumnWriter.
-4. In the file storage format, data and indexes are organized by Page, and
ColumnWriter includes PageBuilder for generating data Page and IndexBuilder for
generating index Page to complete the writing of Page.
-5. Finally, FileWritableBlock is responsible for reading and writing specific
files. For the storage format of the file, please refer to the document
"Introduction to Doris Storage Layer Design 1 - Analysis of Storage Structure
Design".
-
-#### 3.3.2 RowsetWriter writing process
-
-The overall physical writing is shown in the following figure:
-
-
-
-Detailed description of the physical write process:
-
-1. When a Memtable is full (the default is 100M), the data in the Memtable
will be flushed to the disk, and the data in the Memtable will be ordered by
key. It is then written to the RowsetWriter row by row.
-2. The RowsetWriter also writes the data line by line to the SegmentWriter,
and the RowsetWriter maintains the currently being written SegmentWriter and
the list of file blocks to be written. Each time a segment is written, a file
block will be added.
-3. SegmentWriter writes data to each ColumnWriter row by row, and writes
ShortKeyIndexBuilder at the same time. ShortKeyIndexBuilder is mainly
responsible for generating the index Page of ShortKeyIndex. For the specific
ShortKeyIndex index format, please refer to the document "Introduction to Doris
Storage Layer Design 1 - Storage Structure Design Analysis".
-4. ColumnWriter writes data into PageBuilder and each IndexBuilder
respectively. PageBuilder is used to generate PageBuilder for ColumnData data.
Each IndexBuilder includes (OrdinalIndexBuilder generates Page format of
OrdinalIndex row number sparse index, ZoneMapIndexBuilder generates Page format
of ZoneMapIndex index, BitMapIndexBuilder generates BitMapIndex index Page
format, BloomFilterIndexBuilder generates the Page format of the
BloomFilterIndex index). For details, refer to Doris [...]
-5. After adding data, the RowsetWriter performs a flush operation.
-6. The flush operation of SegmentWriter writes data and indexes to disk. The
read and write to the disk is done by FileWritableBlock.
-7. ColumnWriter writes the respective data and pages generated by the index to
the file in sequence.
-8. SegmentWriter generates SegmentFooter information, and SegmentFooter
records the original data information of the Segment file. After completing the
write operation, RowsetWriter will start a new SegmentWriter and write the next
Memtable to the new Segment until the import is complete.
-
-### 3.4 Posted by Rowset
-
-When the data import is complete, DeltaWriter will publish the newly generated
Rowset. The release is to set the Rowset of this version to the visible state,
indicating that the imported data has become effective and can be queried. The
version information indicates the order in which the Rowset takes effect. An
import will generate a Rowset, and each time the import is successful, the
version will be increased in order. The entire release process is as follows:
-
-1. DeltaWriter counts the current RowsetMeta metadata information, including
the number of rows, bytes, time, and segments.
-2. Save to RowsetMeta and submit the import transaction to FE. The current
import transaction is opened by FE to ensure that the data of each BE node is
imported at the same time and takes effect at the same time.
-3. After the FE is coordinated, the FE will issue a Publish task to make the
imported Rowset version take effect. The release's effective version version
information is specified in the task. Only then will the BE storage layer make
this version of the Rowset visible.
-4. Rowset is added to the Tablet of the BE storage layer for management.
-
-## 4 delete process
-
-At present, there are two implementations of Delete, a common delete type is
DELETE, and the other is LOAD_DELETE.
-
-### 4.1 DELETE execution flow
-
-DELETE supports general deletion operations, and the implementation is
relatively simple. In DELETE mode, there is no actual deletion of data, but
data deletion conditions are recorded. Stored in Meta information. Delete
conditions are incorporated into the Base version together when Base Compaction
is performed. The Base version is the first Rowset data version of the Tablet
from [0-x]. The specific process is as follows:
-
-1. When deleting, the FE will directly issue the delete command and delete
conditions.
-2. BE starts an EngineBatchLoadTask task locally, generates a new version of
Rowset, and records the deletion condition information. The Rowset of this
deletion record is slightly different from that of the writing process. The
Rowset only records the deletion condition information without actual data.
-3. FE also publishes the effective version. The Rowset will be added to the
Tablet and the TabletMeta information will be saved.
-
-### 4.2 LOAD_DELETE execution flow
-
-LOAD_DELETE supports the ability to delete data by importing the keys to be
deleted in batches under the UNIQUE KEY model, which can support large-scale
data deletion. The overall idea is to add a deletion status flag to the data
record, and the deleted key will be compressed in the Compaction process.
Compaction is mainly responsible for merging multiple Rowset versions, and the
Compaction process will be described in detail in subsequent articles.
-
-## 5 Summary
-
-This article introduces the writing process and deletion process of the
underlying storage layer of the Doris system in detail. It first describes the
overall writing process of Doris, and then analyzes in detail the design of
Doris's LSM-like storage structure, the data distribution and physical writing
process in the memory part, the Rowset version release and other processes, and
finally introduces the two supported by Doris. Data deletion method。
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]