[GitHub] [cassandra] blambov commented on a diff in pull request #2267: CASSANDRA-18398: Trie-indexed SSTable format

via GitHub Thu, 20 Apr 2023 04:54:26 -0700


blambov commented on code in PR #2267:
URL: https://github.com/apache/cassandra/pull/2267#discussion_r1172477692



##########
src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md:
##########
@@ -0,0 +1,1010 @@
+<!---
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+# Big Trie-Indexed (BTI) SSTable format
+
+This document describes the BTI SSTable format, which is introduced to
+Cassandra with 
[CEP-25](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-25%3A+Trie-indexed+SSTable+format).
+
+The format is called BTI, which stands for "Big Trie-Indexed", because it 
shares
+the data format of the existing BIG format and only changes the primary indexes
+inside SSTables. The format relies on byte order and tries, and uses a 
combination
+of features to make the indexing structures compact and efficient. The 
paragraphs
+below describe the format's features and mechanisms together with the 
motivation 
+behind them, and conclude with detailed description of the on-disk format.
+
+# Prerequisites
+
+## Byte-comparable types
+
+The property of being byte-comparable (also called byte-ordered) for a
+key denotes that there is a serialisation of that key to a sequence of
+bytes where the lexicographic comparison of the unsigned bytes produces
+the same result as performing a typed comparison of the key.
+
+For Cassandra, such a representation is given by
+[CASSANDRA-6936](https://issues.apache.org/jira/browse/CASSANDRA-6936).
+Detailed description of the mechanics of the translation are provided in
+the [included 
documentation](../../../../utils/bytecomparable/ByteComparable.md).
+
+## Tries
+
+A trie is a data structure that describes a mapping between sequences
+and associated values. It is very similar to a deterministic finite
+state automaton, with the main difference that an automaton is allowed
+to have cycles, while a trie is not.
+
+Because the theory and main usage of the structure is for encoding words
+of a language, the trie terminology talks about "characters", "words"
+and "alphabet", which in our case map to bytes of the byte-ordered
+representation, the sequence that encodes it, and the possible values of
+a byte[^1].
+
+[^1]: For simplicity this description assumes we directly map a byte to
+a character. Other options are also possible (e.g. using hex digits
+as the alphabet and two transitions per byte).
+
+A trie can be defined as a tree graph in which vertices are states, some
+of which can be final and contain associated information, and where
+edges are labelled with characters. A valid word in the trie is encoded
+by a path starting from the root of the trie where each edge is labelled
+with the next character of the word, and ending in a final state which
+contains the 'payload' associated with the word.
+
+```mermaid
+graph TD
+  Node_(( ))
+  style Node_ fill:darkgrey
+  Node_ --"a"--> Node_a(((a)))
+    Node_a --"l"--> Node_al(( ))
+      Node_al --"l"--> Node_all(( ))
+        Node_all --"o"--> Node_allo(( ))
+          Node_allo --"w"--> Node_allow(((allow)))
+    Node_a --"n"--> Node_an(((an)))
+      Node_an --"d"--> Node_and(((and)))
+      Node_an --"y"--> Node_any(((any)))
+    Node_a --"r"--> Node_ar(( ))
+      Node_ar --"e"--> Node_are(((are)))
+    Node_a --"s"--> Node_as(((as)))
+  Node_ --"n"--> Node_n(( ))
+    Node_n --"o"--> Node_no(( ))
+      Node_no --"d"--> Node_nod(( ))
+        Node_nod --"e"--> Node_node(((node)))
+  Node_ --"o"--> Node_o(( ))
+    Node_o --"f"--> Node_of(((of)))
+    Node_o --"n"--> Node_on(((on)))
+  Node_ --"t"--> Node_t(( ))
+    Node_t --"h"--> Node_th(( ))
+      Node_th --"e"--> Node_the(((the)))
+      Node_th --"i"--> Node_thi(( ))
+        Node_thi --"s"--> Node_this(((this)))
+    Node_t --"o"--> Node_to(((to)))
+    Node_t --"r"--> Node_tr(( ))
+      Node_tr --"i"--> Node_tri(( ))
+        Node_tri --"e"--> Node_trie(((trie)))
+    Node_t --"y"--> Node_ty(( ))
+      Node_ty --"p"--> Node_typ(( ))
+        Node_typ --"e"--> Node_type(( ))
+          Node_type --"s"--> Node_types(((types)))
+  Node_ --"w"--> Node_w(( ))
+    Node_w --"i"--> Node_wi(( ))
+      Node_wi --"t"--> Node_wit(( ))
+        Node_wit --"h"--> Node_with(((with)))
+          Node_with --"o"--> Node_witho(( ))
+            Node_witho --"u"--> Node_withou(( ))
+              Node_withou --"t"--> Node_without(((without)))
+```
+
+This means that in a constructed trie finding the payload associated
+with a word is a matter of following the edges (also called
+"transitions") from the initial state labelled with the consecutive
+characters of the word, and retrieving the payload associated with the
+state at which we end up. If that's not a final state, or if at any
+point in this we did not find a transition in the trie matching the
+character, the trie does not have an association for the word. The
+complexity of lookup is thus _O_(len(word)) transitions, where the cost of
+taking a transition is usually constant, thus this complexity is
+theoretically optimal.
+
+From a storage space perspective, one of the main benefits of a trie as
+a data structure for storing a map is the fact that it completely avoids
+storing redundant prefixes. All words that start with the same sequence
+store a representation of that sequence only once. If prefixes are
+commonly shared, this can save a great deal of space.
+
+When the items stored in a trie are lexicographically (=byte) ordered, a
+trie is also an ordered structure. A trie can be walked in order and it
+is also possible to efficiently list the items between two given keys.
+
+In fact, one can efficiently (and lazily) apply set algebra over tries,
+and slicing can be seen as a simple application of intersection, where
+the intersecting trie is generated on the fly. The set operations
+benefit from the same prefix-sharing effect &mdash; we apply union /
+intersection / difference to a state, which has the effect of applying
+the operation to all words that share the prefix denoted by that state.
+
+```mermaid
+graph TD
+  Node_(( ))
+  style Node_ fill:darkgrey
+  Node_ --"a"--> Node_a(((a)))
+  style Node_a stroke:lightgrey,color:lightgrey
+  linkStyle 0 stroke:lightgrey,color:lightgrey
+    Node_a --"l"--> Node_al(( ))
+    style Node_al stroke:lightgrey,color:lightgrey
+    linkStyle 1 stroke:lightgrey,color:lightgrey
+      Node_al --"l"--> Node_all(( ))
+      style Node_all stroke:lightgrey,color:lightgrey
+      linkStyle 2 stroke:lightgrey,color:lightgrey
+        Node_all --"o"--> Node_allo(( ))
+        style Node_allo stroke:lightgrey,color:lightgrey
+        linkStyle 3 stroke:lightgrey,color:lightgrey
+          Node_allo --"w"--> Node_allow(((allow)))
+          style Node_allow stroke:lightgrey,color:lightgrey
+          linkStyle 4 stroke:lightgrey,color:lightgrey
+    Node_a --"n"--> Node_an(((an)))
+          style Node_an stroke:lightgrey,color:lightgrey
+          linkStyle 5 stroke:lightgrey,color:lightgrey
+      Node_an --"d"--> Node_and(((and)))
+          style Node_and stroke:lightgrey,color:lightgrey
+          linkStyle 6 stroke:lightgrey,color:lightgrey
+      Node_an --"y"--> Node_any(((any)))
+          style Node_any stroke:lightgrey,color:lightgrey
+          linkStyle 7 stroke:lightgrey,color:lightgrey
+    Node_a --"r"--> Node_ar(( ))
+          style Node_ar stroke:lightgrey,color:lightgrey
+          linkStyle 8 stroke:lightgrey,color:lightgrey
+      Node_ar --"e"--> Node_are(((are)))
+          style Node_are stroke:lightgrey,color:lightgrey
+          linkStyle 9 stroke:lightgrey,color:lightgrey
+    Node_a --"s"--> Node_as(((as)))
+          style Node_as stroke:lightgrey,color:lightgrey
+          linkStyle 10 stroke:lightgrey,color:lightgrey
+  Node_ --"n"--> Node_n(( ))
+    Node_n --"o"--> Node_no(( ))
+      Node_no --"d"--> Node_nod(( ))
+        Node_nod --"e"--> Node_node(((node)))
+  Node_ --"o"--> Node_o(( ))
+    Node_o --"f"--> Node_of(((of)))
+    Node_o --"n"--> Node_on(((on)))
+  Node_ --"t"--> Node_t(( ))
+    Node_t --"h"--> Node_th(( ))
+      Node_th --"e"--> Node_the(((the)))
+      Node_th --"i"--> Node_thi(( ))
+        Node_thi --"s"--> Node_this(((this)))
+          style Node_this stroke:lightgrey,color:lightgrey
+          linkStyle 22 stroke:lightgrey,color:lightgrey
+    Node_t --"o"--> Node_to(((to)))
+          style Node_to stroke:lightgrey,color:lightgrey
+          linkStyle 23 stroke:lightgrey,color:lightgrey
+    Node_t --"r"--> Node_tr(( ))
+          style Node_tr stroke:lightgrey,color:lightgrey
+          linkStyle 24 stroke:lightgrey,color:lightgrey
+      Node_tr --"i"--> Node_tri(( ))
+          style Node_tri stroke:lightgrey,color:lightgrey
+          linkStyle 25 stroke:lightgrey,color:lightgrey
+        Node_tri --"e"--> Node_trie(((trie)))
+          style Node_trie stroke:lightgrey,color:lightgrey
+          linkStyle 26 stroke:lightgrey,color:lightgrey
+    Node_t --"y"--> Node_ty(( ))
+          style Node_ty stroke:lightgrey,color:lightgrey
+          linkStyle 27 stroke:lightgrey,color:lightgrey
+      Node_ty --"p"--> Node_typ(( ))
+          style Node_typ stroke:lightgrey,color:lightgrey
+          linkStyle 28 stroke:lightgrey,color:lightgrey
+        Node_typ --"e"--> Node_type(( ))
+          style Node_type stroke:lightgrey,color:lightgrey
+          linkStyle 29 stroke:lightgrey,color:lightgrey
+          Node_type --"s"--> Node_types(((types)))
+          style Node_types stroke:lightgrey,color:lightgrey
+          linkStyle 30 stroke:lightgrey,color:lightgrey
+  Node_ --"w"--> Node_w(( ))
+          style Node_w stroke:lightgrey,color:lightgrey
+          linkStyle 31 stroke:lightgrey,color:lightgrey
+    Node_w --"i"--> Node_wi(( ))
+          style Node_wi stroke:lightgrey,color:lightgrey
+          linkStyle 32 stroke:lightgrey,color:lightgrey
+      Node_wi --"t"--> Node_wit(( ))
+          style Node_wit stroke:lightgrey,color:lightgrey
+          linkStyle 33 stroke:lightgrey,color:lightgrey
+        Node_wit --"h"--> Node_with(((with)))
+          style Node_with stroke:lightgrey,color:lightgrey
+          linkStyle 34 stroke:lightgrey,color:lightgrey
+          Node_with --"o"--> Node_witho(( ))
+          style Node_witho stroke:lightgrey,color:lightgrey
+          linkStyle 35 stroke:lightgrey,color:lightgrey
+            Node_witho --"u"--> Node_withou(( ))
+          style Node_withou stroke:lightgrey,color:lightgrey
+          linkStyle 36 stroke:lightgrey,color:lightgrey
+              Node_withou --"t"--> Node_without(((without)))
+          style Node_without stroke:lightgrey,color:lightgrey
+          linkStyle 37 stroke:lightgrey,color:lightgrey
+
+```
+
+(An example of slicing the trie above with the range "bit"-"thing".
+Processing only applies on boundary nodes (root, "t", "th", "thi"),
+where we throw away the transitions outside the range. Subtries like the
+ones for "n" and "o" fall completely between "b" and "t" thus are fully
+inside the range and can be processed without any further restrictions.)
+
+A trie can be used as a modifiable in-memory data structure where one
+can add and remove individual elements. It can also be constructed from
+sorted input, incrementally storing the data directly to disk and
+building an efficient read-only on-disk data structure.
+
+For more formal information about the concept and applications of tries
+and finite state automata, try [Introduction to Automata Theory,
+Languages, and Computation](https://books.google.com/books?id=dVipBwAAQBAJ).
+There are many variations of the concept, and of the implementation of
+states and transitions that can be put to use to achieve even further
+efficiency gains; some of these will be detailed below.
+
+# Indexing with tries
+
+Since a trie is generally an ordered byte source to payload map, we can
+apply the concept directly to the components of Cassandra that are most
+affected by the inefficiency of using comparison-based structures: the
+indices.
+
+This can be done in the following way:
+
+-   When we write the index, we map each key into its byte-ordered
+    representation and create an on-disk trie of byte-ordered
+    representations of keys mapping into positions in the data file.
+
+-   When we need an exact match for a key, we create a (lazily
+    generated) byte-ordered representation of the key and look for it
+    in the trie.
+
+    -   If we find a match, we know the data file position.
+
+    -   If there is no match, there is no data associated with the key.
+
+-   When we need a greater-than/greater-or-equal match, we use the
+    byte-ordered representation to create a path that leads to the
+    first matching data position in the sstable.
+
+    -   We can then use this path to iterate the greater keys in the
+        sstable.
+
+This works, but isn't very efficient. Lookup in it is _O_(len(key)), 
+which can even mean that many seeks on disk, and we have to store
+a transition (which defines the size of the structure) for every
+non-prefix character in the dataset.
+
+We can do much better.
+
+## Trimming the fat
+
+The primary purpose of the index is to find a position in the data file
+for the given key. It needs to be able to find the correct position for
+any existing key, but there is no need for it to be exact on keys that
+are not present in the file &mdash; since our data files contain a copy of
+the key at the start of each partition, we can simply check if the key
+we are searching for matches the key at the position returned by the
+index.
+
+This allows us to use a simple optimization: instead of storing the full
+key in the index trie, we can store only a prefix of the key that is
+unique among all partitions in the table. This means that we have
+intermediate nodes in the trie only if a prefix is shared by multiple
+keys, which normally reduces the number of nodes and transitions in the
+trie to about 2*n*.
+
+```mermaid
+graph TD
+  Node_(( ))
+  style Node_ fill:darkgrey
+  Node_  --"a"--> Node_a((( )))
+    Node_a --"l"--> Node_al((( )))
+    Node_a --"n"--> Node_an((( )))
+      Node_an --"d"--> Node_and((( )))
+      Node_an --"y"--> Node_any((( )))
+    Node_a --"r"--> Node_ar((( )))
+    Node_a --"s"--> Node_as((( )))
+  Node_  --"n"--> Node_n((( )))
+  Node_  --"o"--> Node_o(( ))
+    Node_o --"f"--> Node_of((( )))
+    Node_o --"n"--> Node_on((( )))
+  Node_  --"t"--> Node_t(( ))
+    Node_t --"h"--> Node_th(( ))
+      Node_th --"e"--> Node_the((( )))
+      Node_th --"i"--> Node_thi((( )))
+    Node_t --"o"--> Node_to((( )))
+    Node_t --"r"--> Node_tr((( )))
+    Node_t --"y"--> Node_ty((( )))
+  Node_  --"w"--> Node_w(( ))
+    Node_w --"ith"--> Node_with((( )))
+          Node_with --"o"--> Node_without((( )))
+```
+
+This also reduces the number of steps we need to take in the trie. In a
+well-balanced key set (such as the one where the byte-ordered key starts
+with a hash as in Murmur or Random-partitioned primary keys) the lookup
+complexity becomes _O_(log _n_) transitions[^2].
+
+[^2]: For comparison, the complexity of binary search in a sorted
+primary index is also _O_(log _n_), but in key comparisons whose
+complexity on average in a well-balanced key set is another _O_(log _n_)
+for a total _O_(log<sup>2</sup> _n_).
+
+## Taking hardware into account
+
+The point above improves the number of transitions significantly, but
+the out-of-cache efficiency is still pretty bad if we have to read a new
+disk page every time we examine a node. Fortunately we can take some
+extra care during construction to make sure we make the most of every
+disk page brought up during lookup.
+
+The idea of this is to pack wide sections of the trie in pages, so that
+every time we open a page we can be certain to be able to follow several
+transitions before leaving that page.
+
+```mermaid
+graph TD
+  subgraph p1 [ ]
+  Node_(( ))
+  style Node_ fill:darkgrey
+    Node_  --"a"--> Node_a((( )))
+    Node_  --"t"--> Node_t(( ))
+  end
+  
+  subgraph p2 [ ]
+    Node_a --"l"--> Node_al((( )))
+    Node_a --"n"--> Node_an((( )))
+      Node_an --"d"--> Node_and((( )))
+      Node_an --"y"--> Node_any((( )))
+  end
+  
+  subgraph p3 [ ]
+    Node_a --"r"--> Node_ar((( )))
+    Node_a --"s"--> Node_as((( )))
+  Node_  --"n"--> Node_n((( )))
+  end
+
+  subgraph p4 [ ]
+  Node_  --"o"--> Node_o(( ))
+    Node_o --"f"--> Node_of((( )))
+    Node_o --"n"--> Node_on((( )))
+    Node_t --"o"--> Node_to((( )))
+  end
+  
+  subgraph p5 [ ]
+    Node_t --"h"--> Node_th(( ))
+      Node_th --"e"--> Node_the((( )))
+      Node_th --"i"--> Node_thi((( )))
+    Node_t --"r"--> Node_tr((( )))
+  end
+  
+  subgraph p6 [ ]
+    Node_t --"y"--> Node_ty((( )))
+  Node_  --"w"--> Node_w(( ))
+    Node_w --"ith"--> Node_with((( )))
+          Node_with --"o"--> Node_without((( )))
+  end
+  
+  p2 ~~~ p3 ~~~ p4 ~~~ p5 ~~~ p6
+```
+
+One way to generate something like this is to start from the root and do
+a breadth-first walk, placing the encountered nodes on disk until a page
+is filled and their target transitions in a queue for which the process
+is repeated to fill other pages.
+
+Another approach, more suitable to our application because it can be
+done as part of the incremental construction process, is to do the
+packing from the bottom up &mdash; when the incremental construction
+algorithm completes a node we do not immediately write it, but wait
+until we have formed a branch that is bigger than a page. When this
+happens we lay out the node's children (each smaller than a page but
+root of a biggest branch that would fit) and let the parent node be
+treated like a leaf from there on. In turn it will become part of a
+branch that is bigger than a page and will be laid packaged together
+with its related nodes, resulting in a picture similar to the above.
+
+In fact the bottom-up process has a little performance benefit over the
+top-down: with the top-down construction the root page is full and leaf
+pages take combinations of unrelated smaller branches; with the
+bottom-up the leaf pages take as much information as possible about a
+branch, while the root often remains unfilled. For the best possible
+out-of-cache efficiency we would prefer the set of non-leaf pages to be
+as small as possible. Having larger leaf page branches means more of the
+trie data is in the leaf branches and thus the size of that intermediate
+node set is smaller.
+
+See 
[`IncrementalTrieWriterPageAware`](../../../tries/IncrementalDeepTrieWriterPageAware.java)
 
+for details on how the page-aware
+trie construction is implemented.
+
+## Storing the trie
+
+Another interesting question about the format of the trie is how one
+stores the information about the transitions in a node. If we want to
+maintain that the size of the structure is proportional to the number of
+overall transitions, we need to be able to store node transitions
+sparsely. Typically this is done using a list of transition characters
+and binary searching among them to make a transition.
+
+This binary search can theoretically be taken to use constant time
+(since the alphabet size is small and predefined), but isn't the most
+efficient operation in practice due to the unpredictable branch
+instructions necessary for its implementation. It is preferable to avoid
+it as much as possible.
+
+To do this, and to shave a few additional bytes in common cases, our
+implementation of on-disk tries uses typed nodes. A node can be:
+
+-   Final with no transitions (`PAYLOAD_ONLY`).
+
+-   Having one transition (`SINGLE`), which has to store only the
+    character and target for that transition.
+
+-   Having a binary-searched list of transitions (`SPARSE`), where the
+    number of characters, each character and the targets are stored.
+
+-   Having a consecutive range of transitions (`DENSE`), where the first
+    and last character and targets are stored, possibly including some
+    null transitions.
+
+We use one byte per node to store four bits of node type as well as four
+bits of payload information.
+
+In a well-balanced and populated trie the nodes where lookup spends most
+time (the nodes closest to the root) are `DENSE` nodes, where finding the
+target for the transition is a direct calculation from the code of the
+character. On the other hand, most of the nodes (the ones closest to the
+leaves) are `PAYLOAD_ONLY`, `SINGLE` or `SPARSE` to avoid taking any more
+space than necessary.
+
+The main objective for the trie storage format is to achieve the
+smallest possible packing (and thus smallest cache usage and fewest disk
+reads), thus we choose the type that results in the smallest
+representation of the node. `DENSE` type gets chosen naturally when its
+encoding (which avoids storing the character list but may include null
+targets) is smaller than `SPARSE`.
+
+## Pointer Sizes
+
+The next optimization we make in the storage format is based on the fact
+that most nodes in the trie are in the lower levels of the tree and thus
+close to leaves. As such, the distance between the node and its target
+transitions when laid out during the construction process is small and
+thus it is a huge win to store pointers as distances with variable size.
+
+This is even more true for the page-aware layout we use &mdash; all internal
+transitions within the page (i.e. >99% of all transitions in the trie!)
+can be stored using just an offset within the page, using just 12 bits.
+
+This is heavily used via further specialization of the node types: e.g.
+we have `DENSE_12`, `DENSE_16` to `DENSE_40` as well as `DENSE_LONG`
+subtypes which differ in the size of pointer they use.
+
+# Primary indexing in the BTI format
+
+The purpose of the primary index of an sstable is to be able to map a
+key containing partition and clustering components to a position in the
+sstable data file which holds the relevant row or the closest row with a
+greater key and enables iteration of rows from that point on.
+
+Partition keys are normally fully specified, while clustering keys are
+often given partially or via a comparison relation. They are also
+treated differently by all the infrastructure and have historically had
+different index structures; we chose to retain this distinction for the
+time being and implement similar replacement structures using tries.
+
+## Partition index implementation details
+
+The primary purpose of the partition index is to map a specified
+partition key to a row index for the partition. It also needs to support
+iteration from a (possibly partially specified) partition position. The
+description below details mapping only; iteration is a trivial
+application of the trie machinery to the described structure.
+
+In addition to wide partitions where a row index is mandatory, Cassandra
+is often used for tables where the partitions have only a
+couple of rows, including also ones where the partition key is the only
+component of the primary key, i.e. where row and partition are the same
+thing. For these situations it makes no sense to actually have a row
+index and the partition index should point directly to the data.
+
+The application of tries to Cassandra's partition index uses the trie
+infrastructure described above to create a trie mapping unique
+byte-ordered partition key prefixes to either:
+
+-   A position in the row index file which contains the index of the
+    rows within that partition, or
+
+-   A position in the data file containing the relevant partition (if a
+    row index for it is not necessary).
+
+A single table can have both indexed and non-indexed rows. For
+efficiency the partition index stores the position as a single long,
+using its sign bit to differentiate between the two options[^3]. This
+value is stored with variable length &mdash; more precisely, we use the four
+bits provided in the node type byte to store the length of the pointer.
+
+[^3]: It needs to differentiate between 0 with index and 0 without
+index, however, so we use ~pos instead of -pos to encode
+direct-to-data mappings. This still allows sign expansion
+instructions to be used to convert e.g. `int` to `long`.
+
+Lookup in this index is accomplished by converting the decorated
+partition key to its byte-ordered representation and following the
+transitions for its bytes while the trie has any. If at any point the
+trie does not offer a transition for the next byte but is not a leaf
+node, the sstable does not contain a mapping for the given key.
+
+If a leaf of the trie is reached, then the prefix of the partition key
+matches some content in the file, but we are not yet sure if it is a
+full match for the partition key. The leaf node points to a place in the
+row index or data file. In either case the first bytes at the specified
+position contain a serialization of the partition key, which we can
+compare to the key being mapped. If it matches, we have found the
+partition. If not, since the stored prefixes are unique, no data for
+this partition exists in this sstable.
+
+### Efficiency
+
+If everything is in cache this lookup is extremely efficient: it follows
+a few transitions in `DENSE` nodes plus one or two binary searches in
+`SPARSE` or `SINGLE`, and finishes with a direct comparison of a byte buffer
+with contents of a file. No object allocation or deserialization is
+necessary.
+
+If not all data is in cache, the performance of this lookup most heavily
+depends on the number of pages that must be fetched from persistent
+storage. The expectation on which this implementation is based, is that
+if an sstable is in use all non-leaf pages of the index will tend to
+remain cached. If that expectation is met, lookup will only require
+fetching one leaf index page and one data/row index page for the full
+key comparison. On a match the latter fetch will be required anyway,
+since we would want to read the data at that position.
+
+An important consideration in the design of this feature was to make
+sure there is no situation in which the trie indices perform worse than
+the earlier code, thus we should aim to do at most as many reads. The
+number of random accesses for the earlier index implementation where an
+index summary is forced in memory is one _seek_ required to start
+reading from the partition index (though usually multiple consecutive
+pages need to be read), and one seek needed to start reading the actual
+data. Since the index summary ends up being of similar size to the
+non-leaf pages of the trie index, the memory usage and number of seeks
+for the trie index on match ends up being the same but we read less data
+and do much less processing.
+
+On mismatch, though, we may be making one additional seek. However, we
+can drastically reduce the chance of mismatch, which we currently do in
+two ways:
+
+-   By using a bloom filter before lookup. The chance of getting a bloom
+    filter hit as well as a prefix match for the wrong key is pretty
+    low and gets lower with increasing sstable size.
+
+-   By storing some of the key hash bits that are not part of the token
+    at the payload node and comparing them with the mapped key's hash
+    bits.
+
+Currently we use a combination of both by default as the best performing
+option. The user can disable or choose to have a smaller bloom filter,
+and the code also supports indices that do not contain hash bits (though
+to reduce configuration complexity we do not have plans to expose that
+option).
+
+For fully cold sstables we have to perform more random fetches from disk
+than the earlier implementation, but we read less. Testing showed that
+having a bloom filter is enough to make the trie index faster; if a
+bloom filter is not present, we try going through the byte contents of
+the index file on boot to prefetch it which ends up taking not too long
+(since it is read sequentially rather than randomly) and boosting cold
+performance dramatically.
+
+### Building and early open
+
+The partition index is built using the page-aware incremental
+construction described earlier, where we also delay writing each key
+until we have seen the next so that we can find the shortest prefix that
+is enough to differentiate it from the previous and next keys (this also
+differentiates it from all others in the sstable because the contents
+are sorted). Only that prefix is written to the trie.
+
+One last complication is the support for early opening of sstables which
+allows newly-compacted tables to gradually occupy the page cache. Though
+the index building is incremental, the partially-written trie is not
+usable directly because the root of the trie as well as the path from it
+to the last written nodes is not yet present in the file.
+
+This problem can be easily overcome, though, by dumping these
+intermediate nodes to an in-memory buffer (without the need for
+page-aware packing) and forming an index by attaching this buffer at the
+end of the partially written file using 
+[`TailOverridingRebufferer`](../../../util/TailOverridingRebufferer.java).
+
+## Row index implementation details
+
+Unlike the partition index, the main use of the row index is to iterate
+from a given clustering key in forward or reverse direction (where exact
+key lookup is just a special case).
+
+Rows are often very small (they could contain a single int or no columns
+at all) and thus there is a real possibility for the row indices to
+become bigger than the data they represent. This is not a desirable
+outcome, which is part of the reason why Cassandra's row index has
+historically operated on blocks of rows rather than indexing every row
+in the partition. This is a concern we also have with the trie-based
+index, thus we also index blocks of rows (by default, a block of rows
+that is at least 16kb in size &mdash; this will be called the index
+_granularity_ below, specified by the `column_index_size`
+`cassandra.yaml` parameter).
+
+Our row index implementation thus creates a map from clustering keys or
+prefixes to the data position at the start of the index block which is
+the earliest that could contain a key equal or greater than the given
+one. Additionally, if there is an active deletion at the beginning of
+the block, the index must specify it so that it can be taken into
+account when merging data from multiple sstables.
+
+Each index block will contain at least one key, but generally it will
+have different first and last keys. We don't store these keys, but
+instead we index the positions between blocks by storing a "separator",
+some key that is greater than the last key of the previous block and
+smaller than or equal to the first key of the next[^4]. Then, when we
+look up a given clustering, we follow its bytes as long as we can in the
+trie and we can be certain that all blocks before the closest
+less-than-or-equal entry in the trie cannot contain any data that is
+greater than or equal to the given key.
+
+[^4]: Another way to interpret this is that we index the start of each
+block only, but for efficiency we don't use the first key of the
+block as its beginning, but instead something closer to the last key
+of the previous block (but still greater than it).
+
+It may happen that the identified block actually doesn't contain any
+matching data (e.g. because the looked-up key ends up between the last
+key in the block and the separator), but this only affects efficiency as
+the iteration mechanism does not expect the data position returned by
+the index to be guaranteed to start with elements that fit the criteria;
+it would only have to walk a whole block forward to find the matching
+key.
+
+It is important to keep the number of these false positives low, and at
+the same time we aim for the smallest possible size of the index for a
+given granularity. The choice of separator affects this balance[^5]; the
+option we use, as a good tradeoff in the vein of the unique prefix
+approach used in the partition index, is to use the shortest prefix of
+the next block's beginning key that separates it from the previous
+block's end key, adjusted so that the last byte of it is 1 greater than
+that end key.
+
+[^5]: For example, the best separator for false positives is the next
+possible byte sequence after the previous block's final key, which
+is obtained by adding a 00 byte to its end. This, however, means all
+the bytes of the byte-ordered representation of this key must be
+present in the index, which inflates the index's size and lookup
+complexity.
+
+For example, if block 2 covers "something" to "somewhere" and block 3
+&mdash; "sorry" to "tease", then the sequence "son" is used as the separator
+between blocks 2 and 3. This leaves things like "sommelier" in the area
+that triggers false positives, but stores and has to walk just three
+bytes to find the starting point for iteration.
+
+### Efficiency
+
+Finding the candidate block in the trie involves walking the byte
+ordered representation of the clustering key in the trie and finding the
+closest less-than-or-equal value. The number of steps is proportional to
+the length of the separators &mdash; the lower their number the shorter that
+sequence is, though we can't expect _O_(log _n_) complexity since there may
+be many items sharing the same long prefixes (e.g. if there are long
+strings in the components of the clustering keys before the last). Even
+so, such repeating prefixes are addressed very well by the page-packing
+and `SINGLE_NOPAYLOAD_4` node type, resulting in very efficient walks.
+
+After this step we also perform a linear walk within the data file to
+find the actual start of the matching data. This is usually costlier and
+may involve object allocation and deserialization.
+
+The tradeoff between the size of the index and the time it takes to find
+the relevant rows is controlled by the index granularity. The lower it
+is, the more efficient lookup (especially exact match lookup) becomes at
+the expense of bigger index size. The 16kb default is chosen pretty
+conservatively[^6]; if users don't mind bigger indices something like 4,
+2 or 1kb granularity should be quite a bit more efficient. It is also
+possible to index every row by choosing a granularity of 0kb; at these
+settings in-cache trie-indexed sstables tend to outperform
+`ConcurrentSkipListMap` memtables for reads.
+
+[^6]: This was chosen with the aim to match the size of the trie index
+compared to the earlier version of the row index at its default
+granularity of 64kb.
+
+### Reverse lookup
+
+To perform a reverse lookup, we can use the same mechanism as above
+(with greater-than-or-equal) to find the initial block for the
+iteration. However, in the forward direction we could simply walk the
+data file to find the next rows, but this isn't possible going
+backwards.
+
+To solve this problem the index helps the iteration machinery by
+providing an iterator of index blocks in reverse order. For each index
+block the iteration walks it forward and creates a stack of all its row
+positions, then starts issuing rows by popping and examining rows from
+that stack. When the stack is exhausted it requests the previous block
+from the index and applies the same procedure there.
+
+# Code structure
+
+The implementation is mostly in two packages, `o.a.c.io.tries` contains
+the generic code to construct and read on-disk tries, and 
+`o.a.c.io.sstable.format.bti`, which implements the specifics of the
+format and the two indexes.
+
+## Building tries
+
+Tries are built from sorted keys using an 
[`IncrementalTrieWriter`](../../../tries/IncrementalTrieWriter.java). 
+The code contains three implementations with increasing complexity:
+- 
[`IncrementalTrieWriterSimple`](../../../tries/IncrementalTrieWriterSimple.java)
+  implements simple incremental construction of tries from sorted input,
+- 
[`IncrementalTrieWriterPageAware`](../../../tries/IncrementalTrieWriterPageAware.java)
+  adds packing of nodes to disk pages,
+- 
[`IncrementalDeepTrieWriterPageAware`](../../../tries/IncrementalDeepTrieWriterPageAware.java)
+  adds the ability to transition to on-heap recursion for all stages of the 
construction
+  process to be able to handle very large keys.
+
+Only the latter is used, but we provide (and test) the other two as a form of
+documentation.
+
+The builders take a `TrieSerializer` as parameter, which determines how the 
nodes
+are written. The indexes implement this using `TrieNode`, writing any payload 
they
+need immediately after the node serialization.
+
+## Reading tries
+
+The BTI format tries are used directly in their on-disk format. To achieve 
this,
+all node types are implemented as static objects in `TrieNode`. Reading nodes 
in
+a file is encapsulated in [`Walker`](../../../tries/Walker.java), 
+which provides a method to `go` to a specific node and use it, i.e. 
+get any associated data, search in the children list and
+follow transitions to children. It also provides functionality to find the
+mapping for a given key, floors and ceilings as well as some combinations.
+Iterating the payloads between two key bounds is implemented by 
+[`ValueIterator`](../../../tries/ValueIterator.java),
+and [`ReverseValueIterator`](../../../tries/ReverseValueIterator.java).
+
+Special care is given to prefixes to make sure the semantics of searches 
matches
+what the format needs.
+
+## SSTable format implementation
+
+The two indexes are implemented, respectively, by 
[`PartitionIndex`](PartitionIndex.java)
+/[`PartitionIndexBuilder`](PartitionIndexBuilder.java)
+and 
[`RowIndexReader`](RowIndexReader.java)/[`RowIndexWriter`](RowIndexWriter.java).
 
+The format implementation extends the filtered
+base class and follows the structure of the BIG implementation, where
+all references to the primary index are replaced with calls to these two 
+classes.
+
+# Index file format in BTI
+
+## Trie nodes
+Implemented in [`TrieNode.java`](../../../tries/TrieNode.java)
+
+Nodes start with four bits of node type, followed by 4 payload bits
+(_pb_), which are 0 if the node has no associated payload; otherwise the
+node type gives an option to compute the starting position for the
+payload (_ppos_) from the starting position of the node (_npos_).
+The layout of the node depends on its type.
+
+`PAYLOAD_ONLY` nodes:
+
+-   4 type bits, 0
+
+-   4 payload bits
+
+-   payload if _pb_ &ne; 0, _ppos_ is _npos_ + 1
+
+`SINGLE_NOPAYLOAD_4` and `SINGLE_NOPAYLOAD_12` nodes:
+
+-   4 type bits
+
+-   4 pointer bits
+
+-   8 pointer bits (for `SINGLE_NOPAYLOAD_12`)
+
+-   8 bits transition byte
+
+-   _pb_ is assumed 0
+
+`SINGLE_8/16`:
+
+-   4 type bits
+
+-   4 payload bits
+
+-   8 bits transition byte
+
+-   8/16 pointer bits
+
+-   payload if _pb_ &ne; 0, _ppos_ is _npos_ + 3/4
+
+`SPARSE_8/12/16/24/40`:
+
+-   4 type bits
+
+-   4 payload bits
+
+-   8 bit child count
+
+-   8 bits per child, the transition bytes
+
+-   8/12/16/24/40 bits per child, the pointers
+
+-   payload if _pb_ &ne; 0, _ppos_ is _npos_ + 2 + (2/2.5/3/4/6)*(_child
+    count_) (rounded up)
+
+`DENSE_12/16/24/32/40/LONG`:
+
+-   4 type bits
+
+-   4 payload bits
+
+-   8 bit start byte value
+
+-   8 bit _length_-1
+
+-   _length_ * 12/16/24/32/40/64 bits per child, the pointers
+
+-   payload if _pb_ &ne; 0, _ppos_ is _npos_ + 3 + (1.5/2/3/4/5/8)*(_length_)
+    (rounded up)
+
+This is the space taken by each node type (_CS_ stands for child span,
+i.e. largest - smallest + 1, _CC_ is child count):
+
+|Type                 | Size in bytes excl. payload |Size for 1 child|Size for 
9 dense children (01-08, 10)|Size for 10 sparse children (01 + i*10)|Why the 
type is needed          |
+|:--------------------|:----------------------------|---------------:|-------------------:|-------------------:|:-------------------------------|
+|`PAYLOAD_ONLY`       | 1                           | -              | -       
           | -                  |  Leaves dominate the trie      |
+|`SINGLE_NOPAYLOAD_4` | 2                           |2               | -       
           | -                  |  Single-transition chains      |
+|`SINGLE_8`           | 3                           |3               | -       
           | -                  |  Payload within chain          |
+|`SPARSE_8`           | 2 + _CC_ * 2                |4               | 20      
           | 22                 |  Most common type after leaves |
+|`SINGLE_NOPAYLOAD_12`| 3                           |3               | -       
           | -                  |  12 bits cover all in-page transitions    | 
+|`SPARSE_12`          | 2 + _CC_ * 2.5              |5               | 25      
           | 27                 |  Size of sparse is proportional to number of 
children |
+|`DENSE_12`           | 3 + _CS_ * 1.5              |5               | 18      
           | 140                |  Lookup in dense is faster, size smaller if 
few holes    | 
+|`SINGLE_16`          | 4                           |4               | -       
           | -                  |                                |    
+|`SPARSE_16`          | 2 + _CC_ * 3                |5               | 29      
           | 32                 |                                |     
+|`DENSE_16`           | 3 + _CS_ * 2                |5               | 23      
           | 185                |                                |     
+|`SPARSE_24`          | 2 + _CC_ * 4                |6               | 38      
           | 42                 |                                |     
+|`DENSE_24`           | 3 + _CS_ * 3                |6               | 33      
           | 276                |                                |     
+|`DENSE_32`           | 3 + _CS_ * 4                |7               | 43      
           | 367                |  Nodes with big subtrees are usually dense   
| 
+|`SPARSE_40`          | 2 + _CC_ * 6                |8               | 56      
           | 62                 |                                |     
+|`DENSE_40`           | 3 + _CS_ * 5                |8               | 53      
           | 458                |                                |     
+|`DENSE_LONG`         | 3 + _CS_ * 8                |11              | 83      
           | 731                |  Catch-all                     |
+
+All pointers are stored as distances, and since all tries are written
+from the bottom up (and hence a child is always before the parent in the
+file), the distance is subtracted from the position of the current node
+to obtain the position of the child node.
+
+Note: All nodes are placed in such a way that they do not cross a page
+boundary. I.e. if a reader (e.g. [`Walker`](../../../tries/Walker.java)) 
+is positioned at a node, it is
+guaranteed that all reads of the node's data can complete without
+requiring a different page to be fetched from disk.
+
+## Partition index
+Implemented in [`PartitionIndex.java`](PartitionIndex.java)
+
+Layout:
+```
+[nodes page, 4096 bytes]
+...
+[nodes page, 4096 bytes]
+[nodes page including root node, ≤4096 bytes]
+[smallest key, with short length]
+[largest key, with short length]
+[smallest key pos, long]
+[key count, long]
+[root pos, long]
+```
+
+The SSTable's partition index is stored in the -Partitions.db file. The
+file itself is written from the bottom up, and its "header" is at the
+end of the file.
+
+More precisely, the last 3 longs in the file contain:
+
+-   A file position where the smallest and greatest key are written.
+
+-   The exact number of keys in the file.
+
+-   A file position for the root node of the index.
+
+These three longs are preceded by the serialization of the first and
+last key, and before that are the trie contents.
+
+To find a match for the key, start at the root position, decode the node
+(see the "Trie nodes" section above) and follow the transitions
+according to the bytes of the byte-ordered representation of the key
+while the node has children and there are bytes left in the key.
+
+If a leaf node is reached, that node contains the following payload:
+
+-   If _pb_ < 8, let
+
+    -   _idxpos_ be the sign-extended integer value of length _pb_ at
+        _ppos_
+
+-   If _pb_ &ge; 8 (always the case in DSE 6 files), let

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [cassandra] blambov commented on a diff in pull request #2267: CASSANDRA-18398: Trie-indexed SSTable format

Reply via email to