[GitHub] [cassandra] maedhroz commented on a diff in pull request #1723: CASSANDRA-17240: Trie memtable

GitBox Thu, 29 Sep 2022 12:11:07 -0700


maedhroz commented on code in PR #1723:
URL: https://github.com/apache/cassandra/pull/1723#discussion_r983930676



##########
src/java/org/apache/cassandra/db/tries/MemtableReadTrie.java:
##########
@@ -0,0 +1,921 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cassandra.db.tries;
+
+import java.util.Arrays;
+import java.util.concurrent.atomic.AtomicReferenceArray;
+import java.util.function.Function;
+
+import org.agrona.concurrent.UnsafeBuffer;
+import org.apache.cassandra.utils.bytecomparable.ByteSource;
+import org.apache.cassandra.utils.bytecomparable.ByteComparable;
+
+/**
+ * Memtable trie, i.e. an in-memory trie built for fast modification and reads 
executing concurrently with writes from
+ * a single mutator thread.
+ *
+ * This class provides the read-only functionality, expanded in {@link 
MemtableTrie} to writes.
+ */
+public class MemtableReadTrie<T> extends Trie<T>
+{
+    /*
+    TRIE FORMAT AND NODE TYPES
+
+    The memtable trie uses five different types of nodes:
+     - "leaf" nodes, which have content and no children;
+     - single-transition "chain" nodes, which have exactly one child; while 
each node is a single transition, they are
+       called "chain" because multiple such transition are packed in a block.
+     - "sparse" nodes which have between two and six children;
+     - "split" nodes for anything above six children;
+     - "prefix" nodes that augment one of the other types (except leaf) with 
content.
+
+    The data for all nodes except leaf ones is stored in a contiguous 'node 
buffer' and laid out in blocks of 32 bytes.
+    A block only contains data for a single type of node, but there is no 
direct correspondence between block and node
+    in that:
+     - a single block can contain multiple "chain" nodes.
+     - a sparse node occupies exactly one block.
+     - a split node occupies a variable number of blocks.
+     - a prefix node can be placed in the same block as the node it augments, 
or in a separate block.
+
+    Nodes are referenced in that buffer by an integer position/pointer, the 
'node pointer'. Note that node pointers are
+    not pointing at the beginning of blocks, and we call 'pointer offset' the 
offset of the node pointer to the block it
+    points into. The value of a 'node pointer' is used to decide what kind of 
node is pointed:
+
+     - If the pointer is negative, we have a leaf node. Since a leaf has no 
children, we need no data outside of its
+       content to represent it, and that content is stored in a 'content 
list', not in the nodes buffer. The content
+       of a particular leaf node is located at the ~pointer position in the 
content list (~ instead of - so that -1 can
+       correspond to position 0).
+
+     - If the 'pointer offset' is smaller than 28, we have a chain node with 
one transition. The transition character is
+       the byte at the position pointed in the 'node buffer', and the child is 
pointed by:
+       - the integer value at offset 28 of the block pointed if the 'pointer 
offset' is 27
+       - pointer + 1 (which is guaranteed to have offset smaller than 28, i.e. 
to be a chain node), otherwise
+       In other words, a chain block contains a sequence of characters that 
leads to the child whose address is at
+       offset 28. It may have between 1 and 28 characters depending on the 
pointer with which the block is entered.
+
+     - If the 'pointer offset' is 30, we have a sparse node. The data of a 
sparse node occupies a full block and is laid
+       out as:
+       - six pointers to children at offsets 0 to 24
+       - six transition characters at offsets 24 to 30
+       - an order word stored in the two bytes at offset 30
+       To enable in-place addition of children, the pointers and transition 
characters are not stored ordered.
+       Instead, we use an order encoding in the last 2 bytes of the node. The 
encoding is a base-6 number which
+       describes the order of the transitions (least significant digit being 
the smallest).
+       The node must have at least two transitions and the transition at 
position 0 is never the biggest (we can
+       enforce this by choosing for position 0 the smaller of the two 
transitions a sparse node starts with). This
+       allows iteration over the order word (which divides said word by 6 each 
step) to finish when the result becomes 0.
+
+     - If the 'pointer offset' is 28, the node is a split one. Split nodes are 
dense, meaning that there is a direct
+       mapping between a transition character and the address of the 
associated pointer, and new children can easily be
+       added in place.
+       Split nodes occupy multiple blocks, and a child is located by 
traversing 3 layers of pointers:
+       - the first pointer is within the top-level block (the one pointed by 
the pointer) and points to a "mid" block.
+         The top-level block has 4 such pointers to "mid" block, located 
between offset 16 and 32.
+       - the 2nd pointer is within the "mid" block and points to a "tail" 
block. A "mid" block has 8 such pointers
+         occupying the whole block.
+       - the 3rd pointer is with the "tail" block and is the actual child 
pointer. Like "mid" block, there are 8 such
+         pointers (so we finally address 4 * 8 * 8 = 256 children).
+       To find a child, we thus need to know the index of the pointer to 
follow within the top-level block, the index
+       of the one in the "mid" block and the index in the "tail" block. For 
that, we split the transition byte in a
+       sequence of 2-3-3 bits:
+       - the first 2 bits are the index in the top-level block;
+       - the next 3 bits, the index in the "mid" block;
+       - and the last 3 bits the index in the "tail" block.
+       This layout allows the node to use the smaller fixed-size blocks 
(instead of 256*4 bytes for the whole character
+       space) and also leaves some room in the head block (the 16 first bytes) 
for additional information (which we can
+       use to store prefix nodes containing things like deletion times).
+       One split node may need up to 1 + 4 + 4*8 blocks (1184 bytes) to store 
all its children.
+
+     - If the pointer offset is 31, we have a prefix node. These are two types:
+       -- Embedded prefix nodes occupy the free bytes in a chain or split 
node. The byte at offset 4 has the offset
+          within the 32-byte block for the augmented node.
+       -- Full prefix nodes have 0xFF at offset 4 and a pointer at 28, 
pointing to the augmented node.
+       Both types contain an index for content at offset 0. The augmented node 
cannot be a leaf or NONE -- in the former
+       case the leaf itself contains the content index, in the latter we use a 
leaf instead.
+       The term "node" when applied to these is a bit of a misnomer as they 
are not presented as separate nodes during
+       traversals. Instead, they augment a node, changing only its content. 
Internally we create a Node object for the
+       augmented node and wrap a PrefixNode around it, which changes the 
`content()` method and routes all other
+       calls to the augmented node's methods.
+
+     When building a trie we first allocate the content, then create a chain 
node leading to it. While we only have
+     single transitions leading to a chain node, we can expand that node 
(attaching a character and using pointer - 1)
+     instead of creating a new one. When a chain node already has a child and 
needs a new one added we change the type
+     (i.e. create a new node and remap the parent) to sparse with two 
children. When a six-child sparse node needs a new
+     child, we switch to split.
+
+     Blocks currently are not reused, because we do not yet have a mechanism 
to tell when readers are done with blocks
+     they are referencing. This currently causes a very low overhead (because 
we change data in place with the only
+     exception of nodes needing to change type) and is planned to be addressed 
later.
+
+     For further descriptions and examples of the mechanics of the trie, see 
MemtableTrie.md.
+     */
+
+    static final int BLOCK_SIZE = 32;
+
+    // Biggest block offset that can contain a pointer.
+    static final int LAST_POINTER_OFFSET = BLOCK_SIZE - 4;
+
+    /*
+     Block offsets used to identify node types (by comparing them to the node 
'pointer offset').
+     */
+
+    // split node (dense, 2-3-3 transitions), laid out as 4 pointers to "mid" 
block, with has 8 pointers to "tail" block,
+    // which has 8 pointers to children
+    static final int SPLIT_OFFSET = BLOCK_SIZE - 4;
+    // sparse node, unordered list of up to 6 transition, laid out as 6 
transition pointers followed by 6 transition
+    // bytes. The last two bytes contain an ordering of the transitions (in 
base-6) which is used for iteration. On
+    // update the pointer is set last, i.e. during reads the node may show 
that a transition exists and list a character
+    // for it, but pointer may still be null.
+    static final int SPARSE_OFFSET = BLOCK_SIZE - 2;
+    // min and max offset for a chain node. A block of chain node is laid out 
as a pointer at LAST_POINTER_OFFSET,
+    // preceded by characters that lead to it. Thus a full chain block 
contains BLOCK_SIZE-4 transitions/chain nodes.
+    static final int CHAIN_MIN_OFFSET = 0;
+    static final int CHAIN_MAX_OFFSET = BLOCK_SIZE - 5;
+    // Prefix node, an intermediate node augmenting its child node with 
content.
+    static final int PREFIX_OFFSET = BLOCK_SIZE - 1;
+
+    /*
+     Offsets and values for navigating in a block for particular node type. 
Those offsets are 'from the node pointer'
+     (not the block start) and can be thus negative since node pointers points 
towards the end of blocks.
+     */
+
+    // Limit for the starting cell / sublevel (2 bits -> 4 pointers).
+    static final int SPLIT_START_LEVEL_LIMIT = 4;
+    // Limit for the other sublevels (3 bits -> 8 pointers).
+    static final int SPLIT_OTHER_LEVEL_LIMIT = 8;
+    // Bitshift between levels.
+    static final int SPLIT_LEVEL_SHIFT = 3;
+
+    static final int SPARSE_CHILD_COUNT = 6;
+    // Offset to the first child pointer of a spare node (laid out from the 
start of the block)
+    static final int SPARSE_CHILDREN_OFFSET = 0 - SPARSE_OFFSET;

Review Comment:
   ```suggestion
       static final int SPARSE_CHILDREN_OFFSET = -SPARSE_OFFSET;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [cassandra] maedhroz commented on a diff in pull request #1723: CASSANDRA-17240: Trie memtable

Reply via email to