[GitHub] [cassandra] maedhroz commented on a diff in pull request #2267: CASSANDRA-18398: Trie-indexed SSTable format

via GitHub Tue, 18 Apr 2023 16:28:59 -0700


maedhroz commented on code in PR #2267:
URL: https://github.com/apache/cassandra/pull/2267#discussion_r1170661912



##########
src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md:
##########
@@ -0,0 +1,991 @@
+Big Trie-Indexed (BTI) SSTable format
+-------------------------------------
+
+This document describes the BTI SSTable format, which is introduced to
+Cassandra with 
[CEP-25](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-25%3A+Trie-indexed+SSTable+format).
+
+The format is called BTI, which stands for "Big Trie-Indexed", because it 
shares
+the data format of the existing BIG format and only changes the primary indexes
+inside SSTables. The format relies on byte order and tries, and uses a 
combination
+of features to make the indexing structures compact and efficient. The 
paragraphs
+below describe the format's features and mechanisms together with the 
motivation 
+behind them, and conclude with detailed description of the on-disk format.
+
+# Prerequisites
+
+## Byte-comparable types
+
+The property of being byte-comparable (also called byte-ordered) for a
+key denotes that there is a serialisation of that key to a sequence of
+bytes where the lexicographic comparison of the unsigned bytes produces
+the same result as performing a typed comparison of the key.
+
+For Cassandra, such a representation is given by
+[CASSANDRA-6936](https://issues.apache.org/jira/browse/CASSANDRA-6936).
+Detailed description of the mechanics of the translation are provided in
+the [included 
documentation](../../../../utils/bytecomparable/ByteComparable.md).
+
+## Tries
+
+A trie is a data structure that describes a mapping between sequences
+and associated values. It is very similar to a deterministic finite
+state automaton, with the main difference that an automaton is allowed
+to have cycles, while a trie is not.
+
+Because the theory and main usage of the structure is for encoding words
+of a language, the trie terminology talks about "characters", "words"
+and "alphabet", which in our case map to bytes of the byte-ordered
+representation, the sequence that encodes it, and the possible values of
+a byte[^1].
+
+[^1]: For simplicity this description assumes we directly map a byte to
+a character. Other options are also possible (e.g. using hex digits
+as the alphabet and two transitions per byte).
+
+A trie can be defined as a tree graph in which vertices are states, some
+of which can be final and contain associated information, and where
+edges are labelled with characters. A valid word in the trie is encoded
+by a path starting from the root of the trie where each edge is labelled
+with the next character of the word, and ending in a final state which
+contains the 'payload' associated with the word.
+
+```mermaid
+graph TD
+  Node_(( ))
+  style Node_ fill:darkgrey
+  Node_ --"a"--> Node_a(((a)))
+    Node_a --"l"--> Node_al(( ))
+      Node_al --"l"--> Node_all(( ))
+        Node_all --"o"--> Node_allo(( ))
+          Node_allo --"w"--> Node_allow(((allow)))
+    Node_a --"n"--> Node_an(((an)))
+      Node_an --"d"--> Node_and(((and)))
+      Node_an --"y"--> Node_any(((any)))
+    Node_a --"r"--> Node_ar(( ))
+      Node_ar --"e"--> Node_are(((are)))
+    Node_a --"s"--> Node_as(((as)))
+  Node_ --"n"--> Node_n(( ))
+    Node_n --"o"--> Node_no(( ))
+      Node_no --"d"--> Node_nod(( ))
+        Node_nod --"e"--> Node_node(((node)))
+  Node_ --"o"--> Node_o(( ))
+    Node_o --"f"--> Node_of(((of)))
+    Node_o --"n"--> Node_on(((on)))
+  Node_ --"t"--> Node_t(( ))
+    Node_t --"h"--> Node_th(( ))
+      Node_th --"e"--> Node_the(((the)))
+      Node_th --"i"--> Node_thi(( ))
+        Node_thi --"s"--> Node_this(((this)))
+    Node_t --"o"--> Node_to(((to)))
+    Node_t --"r"--> Node_tr(( ))
+      Node_tr --"i"--> Node_tri(( ))
+        Node_tri --"e"--> Node_trie(((trie)))
+    Node_t --"y"--> Node_ty(( ))
+      Node_ty --"p"--> Node_typ(( ))
+        Node_typ --"e"--> Node_type(( ))
+          Node_type --"s"--> Node_types(((types)))
+  Node_ --"w"--> Node_w(( ))
+    Node_w --"i"--> Node_wi(( ))
+      Node_wi --"t"--> Node_wit(( ))
+        Node_wit --"h"--> Node_with(((with)))
+          Node_with --"o"--> Node_witho(( ))
+            Node_witho --"u"--> Node_withou(( ))
+              Node_withou --"t"--> Node_without(((without)))
+```
+
+This means that in a constructed trie finding the payload associated
+with a word is a matter of following the edges (also called
+"transitions") from the initial state labelled with the consecutive
+characters of the word, and retrieving the payload associated with the
+state at which we end up. If that's not a final state, or if at any
+point in this we did not find a transition in the trie matching the
+character, the trie does not have an association for the word. The
+complexity of lookup is thus _O_(len(word)) transitions, where the cost of
+taking a transition is usually constant, thus this complexity is
+theoretically optimal.
+
+From a storage space perspective, one of the main benefits of a trie as
+a data structure for storing a map is the fact that it completely avoids
+storing redundant prefixes. All words that start with the same sequence
+store a representation of that sequence only once. If prefixes are
+commonly shared, this can save a great deal of space.
+
+When the items stored in a trie are lexicographically (=byte) ordered, a
+trie is also an ordered structure. A trie can be walked in order and it
+is also possible to efficiently list the items between two given keys.
+
+In fact, one can efficiently (and lazily) apply set algebra over tries,
+and slicing can be seen as a simple application of intersection, where
+the intersecting trie is generated on the fly. The set operations
+benefit from the same prefix-sharing effect &mdash; we apply union /
+intersection / difference to a state, which has the effect of applying
+the operation to all words that share the prefix denoted by that state.
+
+```mermaid
+graph TD
+  Node_(( ))
+  style Node_ fill:darkgrey
+  Node_ --"a"--> Node_a(((a)))
+  style Node_a stroke:lightgrey,color:lightgrey
+  linkStyle 0 stroke:lightgrey,color:lightgrey
+    Node_a --"l"--> Node_al(( ))
+    style Node_al stroke:lightgrey,color:lightgrey
+    linkStyle 1 stroke:lightgrey,color:lightgrey
+      Node_al --"l"--> Node_all(( ))
+      style Node_all stroke:lightgrey,color:lightgrey
+      linkStyle 2 stroke:lightgrey,color:lightgrey
+        Node_all --"o"--> Node_allo(( ))
+        style Node_allo stroke:lightgrey,color:lightgrey
+        linkStyle 3 stroke:lightgrey,color:lightgrey
+          Node_allo --"w"--> Node_allow(((allow)))
+          style Node_allow stroke:lightgrey,color:lightgrey
+          linkStyle 4 stroke:lightgrey,color:lightgrey
+    Node_a --"n"--> Node_an(((an)))
+          style Node_an stroke:lightgrey,color:lightgrey
+          linkStyle 5 stroke:lightgrey,color:lightgrey
+      Node_an --"d"--> Node_and(((and)))
+          style Node_and stroke:lightgrey,color:lightgrey
+          linkStyle 6 stroke:lightgrey,color:lightgrey
+      Node_an --"y"--> Node_any(((any)))
+          style Node_any stroke:lightgrey,color:lightgrey
+          linkStyle 7 stroke:lightgrey,color:lightgrey
+    Node_a --"r"--> Node_ar(( ))
+          style Node_ar stroke:lightgrey,color:lightgrey
+          linkStyle 8 stroke:lightgrey,color:lightgrey
+      Node_ar --"e"--> Node_are(((are)))
+          style Node_are stroke:lightgrey,color:lightgrey
+          linkStyle 9 stroke:lightgrey,color:lightgrey
+    Node_a --"s"--> Node_as(((as)))
+          style Node_as stroke:lightgrey,color:lightgrey
+          linkStyle 10 stroke:lightgrey,color:lightgrey
+  Node_ --"n"--> Node_n(( ))
+    Node_n --"o"--> Node_no(( ))
+      Node_no --"d"--> Node_nod(( ))
+        Node_nod --"e"--> Node_node(((node)))
+  Node_ --"o"--> Node_o(( ))
+    Node_o --"f"--> Node_of(((of)))
+    Node_o --"n"--> Node_on(((on)))
+  Node_ --"t"--> Node_t(( ))
+    Node_t --"h"--> Node_th(( ))
+      Node_th --"e"--> Node_the(((the)))
+      Node_th --"i"--> Node_thi(( ))
+        Node_thi --"s"--> Node_this(((this)))
+          style Node_this stroke:lightgrey,color:lightgrey
+          linkStyle 22 stroke:lightgrey,color:lightgrey
+    Node_t --"o"--> Node_to(((to)))
+          style Node_to stroke:lightgrey,color:lightgrey
+          linkStyle 23 stroke:lightgrey,color:lightgrey
+    Node_t --"r"--> Node_tr(( ))
+          style Node_tr stroke:lightgrey,color:lightgrey
+          linkStyle 24 stroke:lightgrey,color:lightgrey
+      Node_tr --"i"--> Node_tri(( ))
+          style Node_tri stroke:lightgrey,color:lightgrey
+          linkStyle 25 stroke:lightgrey,color:lightgrey
+        Node_tri --"e"--> Node_trie(((trie)))
+          style Node_trie stroke:lightgrey,color:lightgrey
+          linkStyle 26 stroke:lightgrey,color:lightgrey
+    Node_t --"y"--> Node_ty(( ))
+          style Node_ty stroke:lightgrey,color:lightgrey
+          linkStyle 27 stroke:lightgrey,color:lightgrey
+      Node_ty --"p"--> Node_typ(( ))
+          style Node_typ stroke:lightgrey,color:lightgrey
+          linkStyle 28 stroke:lightgrey,color:lightgrey
+        Node_typ --"e"--> Node_type(( ))
+          style Node_type stroke:lightgrey,color:lightgrey
+          linkStyle 29 stroke:lightgrey,color:lightgrey
+          Node_type --"s"--> Node_types(((types)))
+          style Node_types stroke:lightgrey,color:lightgrey
+          linkStyle 30 stroke:lightgrey,color:lightgrey
+  Node_ --"w"--> Node_w(( ))
+          style Node_w stroke:lightgrey,color:lightgrey
+          linkStyle 31 stroke:lightgrey,color:lightgrey
+    Node_w --"i"--> Node_wi(( ))
+          style Node_wi stroke:lightgrey,color:lightgrey
+          linkStyle 32 stroke:lightgrey,color:lightgrey
+      Node_wi --"t"--> Node_wit(( ))
+          style Node_wit stroke:lightgrey,color:lightgrey
+          linkStyle 33 stroke:lightgrey,color:lightgrey
+        Node_wit --"h"--> Node_with(((with)))
+          style Node_with stroke:lightgrey,color:lightgrey
+          linkStyle 34 stroke:lightgrey,color:lightgrey
+          Node_with --"o"--> Node_witho(( ))
+          style Node_witho stroke:lightgrey,color:lightgrey
+          linkStyle 35 stroke:lightgrey,color:lightgrey
+            Node_witho --"u"--> Node_withou(( ))
+          style Node_withou stroke:lightgrey,color:lightgrey
+          linkStyle 36 stroke:lightgrey,color:lightgrey
+              Node_withou --"t"--> Node_without(((without)))
+          style Node_without stroke:lightgrey,color:lightgrey
+          linkStyle 37 stroke:lightgrey,color:lightgrey
+
+```
+
+(An example of slicing the trie above with the range "bit"-"thing".
+Processing only applies on boundary nodes (root, "t", "th", "thi"),
+where we throw away the transitions outside the range. Subtries like the
+ones for "n" and "o" fall completely between "b" and "t" thus are fully
+inside the range and can be processed without any further restrictions.)
+
+A trie can be used as a modifiable in-memory data structure where one
+can add and remove individual elements. It can also be constructed from
+sorted input, incrementally storing the data directly to disk and
+building an efficient read-only on-disk data structure.
+
+For more formal information about the concept and applications of tries
+and finite state automata, try [Introduction to Automata Theory,
+Languages, and Computation](https://books.google.bg/books?id=dVipBwAAQBAJ).
+There are many variations of the concept, and of the implementation of
+states and transitions that can be put to use to achieve even further
+efficiency gains; some of these will be detailed below.
+
+# Indexing with tries
+
+Since a trie is generally an ordered byte source to payload map, we can
+apply the concept directly to the components of Cassandra that are most
+affected by the inefficiency of using comparison-based structures: the
+indices.
+
+This can be done in the following way:
+
+-   When we write the index, we map each key into its byte-ordered
+    representation and create an on-disk trie of byte-ordered
+    representations of keys mapping into positions in the data file.
+
+-   When we need an exact match for a key, we create a (lazily
+    generated) byte-ordered representation of the key and look for it
+    in the trie.
+
+    -   If we find a match, we know the data file position.
+
+    -   If there is no match, there is no data associated with the key.
+
+-   When we need a greater-than/greater-or-equal match, we use the
+    byte-ordered representation to create a path that leads to the
+    first matching data position in the sstable.
+
+    -   We can then use this path to iterate the greater keys in the
+        sstable.
+
+This works, but isn't very efficient. Lookup in it is _O_(len(key)), 
+which can even mean that many seeks on disk, and we have to store
+a transition (which defines the size of the structure) for every
+non-prefix character in the dataset.
+
+We can do much better.
+
+## Trimming the fat
+
+The primary purpose of the index is to find a position in the data file
+for the given key. It needs to be able to find the correct position for
+any existing key, but there is no need for it to be exact on keys that
+are not present in the file &mdash; since our data files contain a copy of
+the key at the start of each partition, we can simply check if the key
+we are searching for matches the key at the position returned by the
+index.
+
+This allows us to use a simple optimization: instead of storing the full
+key in the index trie, we can store only a prefix of the key that is
+unique among all partitions in the table. This means that we have
+intermediate nodes in the trie only if a prefix is shared by multiple
+keys, which normally reduces the number of nodes and transitions in the
+trie to about 2*n*.
+
+```mermaid
+graph TD
+  Node_(( ))
+  style Node_ fill:darkgrey
+  Node_  --"a"--> Node_a((( )))
+    Node_a --"l"--> Node_al((( )))
+    Node_a --"n"--> Node_an((( )))
+      Node_an --"d"--> Node_and((( )))
+      Node_an --"y"--> Node_any((( )))
+    Node_a --"r"--> Node_ar((( )))
+    Node_a --"s"--> Node_as((( )))
+  Node_  --"n"--> Node_n((( )))
+  Node_  --"o"--> Node_o(( ))
+    Node_o --"f"--> Node_of((( )))
+    Node_o --"n"--> Node_on((( )))
+  Node_  --"t"--> Node_t(( ))
+    Node_t --"h"--> Node_th(( ))
+      Node_th --"e"--> Node_the((( )))
+      Node_th --"i"--> Node_thi((( )))
+    Node_t --"o"--> Node_to((( )))
+    Node_t --"r"--> Node_tr((( )))
+    Node_t --"y"--> Node_ty((( )))
+  Node_  --"w"--> Node_w(( ))
+    Node_w --"ith"--> Node_with((( )))
+          Node_with --"o"--> Node_without((( )))
+```
+
+This also reduces the number of steps we need to take in the trie. In a
+well-balanced key set (such as the one where the byte-ordered key starts
+with a hash as in Murmur or Random-partitioned primary keys) the lookup
+complexity becomes _O_(log _n_) transitions[^2].
+
+[^2]: For comparison, the complexity of binary search in a sorted
+primary index is also _O_(log _n_), but in key comparisons whose
+complexity on average in a well-balanced key set is another _O_(log _n_)
+for a total _O_(log<sup>2</sup> _n_).
+
+## Taking hardware into account
+
+The point above improves the number of transitions significantly, but
+the out-of-cache efficiency is still pretty bad if we have to read a new
+disk page every time we examine a node. Fortunately we can take some
+extra care during construction to make sure we make the most of every
+disk page brought up during lookup.
+
+The idea of this is to pack wide sections of the trie in pages, so that
+every time we open a page we can be certain to be able to follow several
+transitions before leaving that page.
+
+```mermaid
+graph TD
+  subgraph p1 [ ]
+  Node_(( ))
+  style Node_ fill:darkgrey
+    Node_  --"a"--> Node_a((( )))
+    Node_  --"t"--> Node_t(( ))
+  end
+  
+  subgraph p2 [ ]
+    Node_a --"l"--> Node_al((( )))
+    Node_a --"n"--> Node_an((( )))
+      Node_an --"d"--> Node_and((( )))
+      Node_an --"y"--> Node_any((( )))
+  end
+  
+  subgraph p3 [ ]
+    Node_a --"r"--> Node_ar((( )))
+    Node_a --"s"--> Node_as((( )))
+  Node_  --"n"--> Node_n((( )))
+  end
+  
+  subgraph p4 [ ]
+  Node_  --"o"--> Node_o(( ))
+    Node_o --"f"--> Node_of((( )))
+    Node_o --"n"--> Node_on((( )))
+    Node_t --"o"--> Node_to((( )))
+  end
+  
+  subgraph p5 [ ]
+    Node_t --"h"--> Node_th(( ))
+      Node_th --"e"--> Node_the((( )))
+      Node_th --"i"--> Node_thi((( )))
+    Node_t --"r"--> Node_tr((( )))
+  end
+  
+  subgraph p6 [ ]
+    Node_t --"y"--> Node_ty((( )))
+  Node_  --"w"--> Node_w(( ))
+    Node_w --"ith"--> Node_with((( )))
+          Node_with --"o"--> Node_without((( )))
+  end
+  
+  p2 ~~~ p3 ~~~ p4 ~~~ p5 ~~~ p6
+```
+
+One way to generate something like this is to start from the root and do
+a breadth-first walk, placing the encountered nodes on disk until a page
+is filled and their target transitions in a queue for which the process
+is repeated to fill other pages.
+
+Another approach, more suitable to our application because it can be
+done as part of the incremental construction process, is to do the
+packing from the bottom up &mdash; when the incremental construction
+algorithm completes a node we do not immediately write it, but wait
+until we have formed a branch that is bigger than a page. When this
+happens we lay out the node's children (each smaller than a page but
+root of a biggest branch that would fit) and let the parent node be
+treated like a leaf from there on. In turn it will become part of a
+branch that is bigger than a page and will be laid packaged together
+with its related nodes, resulting in a picture similar to the above.
+
+In fact the bottom-up process has a little performance benefit over the
+top-down: with the top-down construction the root page is full and leaf
+pages take combinations of unrelated smaller branches; with the
+bottom-up the leaf pages take as much information as possible about a
+branch, while the root often remains unfilled. For the best possible
+out-of-cache efficiency we would prefer the set of non-leaf pages to be
+as small as possible. Having larger leaf page branches means more of the
+trie data is in the leaf branches and thus the size of that intermediate
+node set is smaller.
+
+See 
[`IncrementalTrieWriterPageAware`](../../../tries/IncrementalDeepTrieWriterPageAware.java)
 
+for details on how the page-aware
+trie construction is implemented.
+
+## Storing the trie
+
+Another interesting question about the format of the trie is how one
+stores the information about the transitions in a node. If we want to
+maintain that the size of the structure is proportional to the number of
+overall transitions, we need to be able to store node transitions
+sparsely. Typically this is done using a list of transition characters
+and binary searching among them to make a transition.
+
+This binary search can theoretically be taken to use constant time
+(since the alphabet size is small and predefined), but isn't the most
+efficient operation in practice due to the unpredictable branch
+instructions necessary for its implementation. It is preferable to avoid
+it as much as possible.
+
+To do this, and to shave a few additional bytes in common cases, our
+implementation of on-disk tries uses typed nodes. A node can be:
+
+-   Final with no transitions (`PAYLOAD_ONLY`).
+
+-   Having one transition (`SINGLE`), which has to store only the
+    character and target for that transition.
+
+-   Having a binary-searched list of transitions (`SPARSE`), where the
+    number of characters, each character and the targets are stored.
+
+-   Having a consecutive range of transitions (`DENSE`), where the first
+    and last character and targets are stored, possibly including some
+    null transitions.
+
+We use one byte per node to store four bits of node type as well as four
+bits of payload information.
+
+In a well-balanced and populated trie the nodes where lookup spends most
+time (the nodes closest to the root) are `DENSE` nodes, where finding the
+target for the transition is a direct calculation from the code of the
+character. On the other hand, most of the nodes (the ones closest to the
+leaves) are `PAYLOAD_ONLY`, `SINGLE` or `SPARSE` to avoid taking any more
+space than necessary.
+
+The main objective for the trie storage format is to achieve the
+smallest possible packing (and thus smallest cache usage and fewest disk
+reads), thus we choose the type that results in the smallest
+representation of the node. `DENSE` type gets chosen naturally when its
+encoding (which avoids storing the character list but may include null
+targets) is smaller than `SPARSE`.
+
+## Pointer Sizes
+
+The next optimization we make in the storage format is based on the fact
+that most nodes in the trie are in the lower levels of the tree and thus
+close to leaves. As such, the distance between the node and its target
+transitions when laid out during the construction process is small and
+thus it is a huge win to store pointers as distances with variable size.
+
+This is even more true for the page-aware layout we use &mdash; all internal
+transitions within the page (i.e. >99% of all transitions in the trie!)
+can be stored using just an offset within the page, using just 12 bits.
+
+This is heavily used via further specialization of the node types: e.g.
+we have `DENSE_12`, `DENSE_16` to `DENSE_40` as well as `DENSE_LONG`
+subtypes which differ in the size of pointer they use.
+
+# Primary indexing in the BTI format
+
+The purpose of the primary index of an sstable is to be able to map a
+key containing partition and clustering components to a position in the
+sstable data file which holds the relevant row or the closest row with a
+greater key and enables iteration of rows from that point on.
+
+Partition keys are normally fully specified, while clustering keys are
+often given partially or via a comparison relation. They are also
+treated differently by all the infrastructure and have historically had
+different index structures; we chose to retain this distinction for the
+time being and implement similar replacement structures using tries.
+
+## Partition index implementation details
+
+The primary purpose of the partition index is to map a specified
+partition key to a row index for the partition. It also needs to support
+iteration from a (possibly partially specified) partition position. The
+description below details mapping only; iteration is a trivial
+application of the trie machinery to the described structure.
+
+In addition to wide partitions where a row index is mandatory, Cassandra
+is often used for tables where the partitions have only a
+couple of rows, including also ones where the partition key is the only
+component of the primary key, i.e. where row and partition are the same
+thing. For these situations it makes no sense to actually have a row
+index and the partition index should point directly to the data.
+
+The application of tries to Cassandra's partition index uses the trie
+infrastructure described above to create a trie mapping unique
+byte-ordered partition key prefixes to either:
+
+-   A position in the row index file which contains the index of the
+    rows within that partition, or
+
+-   A position in the data file containing the relevant partition (if a
+    row index for it is not necessary).
+
+A single table can have both indexed and non-indexed rows. For
+efficiency the partition index stores the position as a single long,
+using its sign bit to differentiate between the two options[^3]. This
+value is stored with variable length &mdash; more precisely, we use the four
+bits provided in the node type byte to store the length of the pointer.
+
+[^3]: It needs to differentiate between 0 with index and 0 without
+index, however, so we use ~pos instead of -pos to encode
+direct-to-data mappings. This still allows sign expansion
+instructions to be used to convert e.g. `int` to `long`.
+
+Lookup in this index is accomplished by converting the decorated
+partition key to its byte-ordered representation and following the
+transitions for its bytes while the trie has any. If at any point the
+trie does not offer a transition for the next byte but is not a leaf
+node, the sstable does not contain a mapping for the given key.
+
+If a leaf of the trie is reached, then the prefix of the partition key
+matches some content in the file, but we are not yet sure if it is a
+full match for the partition key. The leaf node points to a place in the
+row index or data file. In either case the first bytes at the specified
+position contain a serialization of the partition key, which we can
+compare to the key being mapped. If it matches, we have found the
+partition. If not, since the stored prefixes are unique, no data for
+this partition exists in this sstable.
+
+### Efficiency
+
+If everything is in cache this lookup is extremely efficient: it follows
+a few transitions in `DENSE` nodes plus one or two binary searches in
+`SPARSE` or `SINGLE`, and finishes with a direct comparison of a byte buffer
+with contents of a file. No object allocation or deserialization is
+necessary.
+
+If not all data is in cache, the performance of this lookup most heavily
+depends on the number of pages that must be fetched from persistent
+storage. The expectation on which this implementation is based, is that
+if an sstable is in use all non-leaf pages of the index will tend to
+remain cached. If that expectation is met, lookup will only require
+fetching one leaf index page and one data/row index page for the full
+key comparison. On a match the latter fetch will be required anyway,
+since we would want to read the data at that position.
+
+An important consideration in the design of this feature was to make
+sure there is no situation in which the trie indices perform worse than
+the earlier code, thus we should aim to do at most as many reads. The
+number of random accesses for the earlier index implementation where an
+index summary is forced in memory is one _seek_ required to start
+reading from the partition index (though usually multiple consecutive
+pages need to be read), and one seek needed to start reading the actual
+data. Since the index summary ends up being of similar size to the
+non-leaf pages of the trie index, the memory usage and number of seeks
+for the trie index on match ends up being the same but we read less data
+and do much less processing.
+
+On mismatch, though, we may be making one additional seek. However, we
+can drastically reduce the chance of mismatch, which we currently do in
+two ways:
+
+-   By using a bloom filter before lookup. The chance of getting a bloom
+    filter hit as well as a prefix match for the wrong key is pretty
+    low and gets lower with increasing sstable size.
+
+-   By storing some of the key hash bits that are not part of the token
+    at the payload node and comparing them with the mapped key's hash
+    bits.
+
+Currently we use a combination of both by default as the best performing
+option. The user can disable or choose to have a smaller bloom filter,
+and the code also supports indices that do not contain hash bits (though
+to reduce configuration complexity we do not have plans to expose that
+option).
+
+For fully cold sstables we have to perform more random fetches from disk
+than the earlier implementation, but we read less. Testing showed that
+having a bloom filter is enough to make the trie index faster; if a
+bloom filter is not present, we try going through the byte contents of
+the index file on boot to prefetch it which ends up taking not too long
+(since it is read sequentially rather than randomly) and boosting cold
+performance dramatically.
+
+### Building and early open
+
+The partition index is built using the page-aware incremental
+construction described earlier, where we also delay writing each key
+until we have seen the next so that we can find the shortest prefix that
+is enough to differentiate it from the previous and next keys (this also
+differentiates it from all others in the sstable because the contents
+are sorted). Only that prefix is written to the trie.
+
+One last complication is the support for early opening of sstables which
+allows newly-compacted tables to gradually occupy the page cache. Though
+the index building is incremental, the partially-written trie is not
+usable directly because the root of the trie as well as the path from it
+to the last written nodes is not yet present in the file.
+
+This problem can be easily overcome, though, by dumping these
+intermediate nodes to an in-memory buffer (without the need for
+page-aware packing) and forming an index by attaching this buffer at the
+end of the partially written file using 
+[`TailOverridingRebufferer`](../../../util/TailOverridingRebufferer.java).
+
+## Row index implementation details
+
+Unlike the partition index, the main use of the row index is to iterate
+from a given clustering key in forward or reverse direction (where exact
+key lookup is just a special case).
+
+Rows are often very small (they could contain a single int or no columns
+at all) and thus there is a real possibility for the row indices to
+become bigger than the data they represent. This is not a desirable
+outcome, which is part of the reason why Cassandra's row index has
+historically operated on blocks of rows rather than indexing every row
+in the partition. This is a concern we also have with the trie-based
+index, thus we also index blocks of rows (by default, a block of rows
+that is at least 16kb in size &mdash; this will be called the index
+_granularity_ below, specified by the `column_index_size_in_kb`

Review Comment:
   ```suggestion
   _granularity_ below, specified by the `column_index_size`
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [cassandra] maedhroz commented on a diff in pull request #2267: CASSANDRA-18398: Trie-indexed SSTable format

Reply via email to