Ivan Bessonov created IGNITE-16933:
--------------------------------------
Summary: PageMemory-based MV storage implementation
Key: IGNITE-16933
URL: https://issues.apache.org/jira/browse/IGNITE-16933
Project: Ignite
Issue Type: New Feature
Reporter: Ivan Bessonov
Similar to IGNITE-16611, we need an MV-storage implementation for page memory
storage engine. Currently, I expect only row storage implementation, without
primary or secondary indexes.
Here I'm going to describe a data format. Each row is stored as a versioned
chain. It will be represented by a number of data entries that will have
references to each other.
{code:java}
[ Timestamp | NextLink | PayloadSize | Payload ]{code}
* Timestamp is a 16 bytes value derived from
{{org.apache.ignite.internal.tx.Timestamp}} instance.
* NextLink is a link to the next element in the chain or a NULL_LINK (or any
other convenient name). It's a long value in standard format for Page Memory
links (itemId, flag, partitionId, pageIdx). Technically, partition id is not
needed here, because it's always the same. Removing it could allow us to save 2
bytes per chain element.
* PayloadSize is a 4-bytes integer value that gives us the size of actual data
in arbitrary format.
* Payload - I expect it to be a serialized BinaryRow data. This is how it's
implemented in RocksDB right now.
For uncommitted (pending) entries I propose using maximal possible timestamp -
{{{}(Long.MAX_VALUE, Long.MAX_VALUE){}}}. This will simplify things. Note that
we never store tx id in chain itself.
Overall, every chain element will have a (16 + 6 + 4 = 26) bytes header. It
should be used as a header size in corresponding FreeList.
There's a requirement to have an immutable RowId for every versioned chain. One
could argue that we should just make chain head immutable, but it would result
in lots of complications. It's better to have a separate structure with
immutable link, that will point to an actual head of the versioned chain.
{code:java}
[ TransactionId | HeadLink | NextLink ]{code}
* TransactionId is a UUID. Can only be applied to pending entries. For
committed head I propose storing 16 zeroes.
* HeadLink is a link to the chain's head. Either 8 or 6 bytes. As already
mentioned, I'd prefer 6.
* NextLink is a "NextLink" value from the head chain element. It's a cheap
shortcut for read-only transactions, you can skip uncommitted entry without
even trying to read it, if there's a non-null transaction id. Debatable, I
know, but looks cheap enough.
In total, RowId is a 8 bytes link, pointing to a structure that has (16 + 6 + 6
= 28) bytes of data. There must e a separate FreeList for every partition even
in In-Memory mode for reasons that I'll give later. "Header" size in that list
must be equal to these 28 bytes. I wander how effective FreeList will be for
this case, where every chunk has the same size. We'll see. Maybe we should
adjust a number of buckets somehow.
Now, the fun part. There's no mention of B+Tree here. That's because we can
probably just avoid it. If it existed, it would just point RowId to a described
RowId structure in partition, but RowId is already a pointer itself. The only
other problem that is usually solved by a tree-like structure is a full-scan of
all rows in partition. This is useful when you need to rebuild indexes, for
example.
We should keep in mind that there's no code yet for rebuilding indexes. On the
other hand, there's a method for partition scan in the API. It could be used to
implement a Primary Index imitation until we have a real implementation.
There's not FreeList full-scan currently in the code, it needs to be
implemented. And, this particular full-scan is the reason why every partition
should have its own list of row ids.
There's also a chance that introducing new flag for row ids might be
convenient. I don't know yet, let's not do it for now.
Finally, we need an adequate protection from assertions if we, for some reason,
have invalid row id. Things that can be checked be a normal code, not assertion:
* data page type
* number of items in the page
--
This message was sent by Atlassian Jira
(v8.20.7#820007)