[
https://issues.apache.org/jira/browse/IGNITE-16933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545049#comment-17545049
]
Roman Puchkovskiy commented on IGNITE-16933:
--------------------------------------------
Thank you for your review!
> PageMemory-based MV storage implementation
> ------------------------------------------
>
> Key: IGNITE-16933
> URL: https://issues.apache.org/jira/browse/IGNITE-16933
> Project: Ignite
> Issue Type: New Feature
> Reporter: Ivan Bessonov
> Assignee: Roman Puchkovskiy
> Priority: Major
> Labels: ignite-3
> Fix For: 3.0.0-alpha5
>
> Time Spent: 8h 50m
> Remaining Estimate: 0h
>
> Similar to IGNITE-16611, we need an MV-storage implementation for page memory
> storage engine. Currently, I expect only row storage implementation, without
> primary or secondary indexes.
> h2. Chain Structure
> Here I'm going to describe a data format. Each row is stored as a versioned
> chain. It will be represented by a number of data entries that will have
> references to each other.
> {code:java}
> [ Timestamp | NextLink | PayloadSize | Payload ]{code}
> * Timestamp is a 16 bytes value derived from
> {{org.apache.ignite.internal.tx.Timestamp}} instance. It represents a commit
> time of corresponding row.
> * NextLink is a link to the next element in the chain or a NULL_LINK (or any
> other convenient name). It's a long value in standard format for Page Memory
> links (itemId, flag, partitionId, pageIdx). Technically, partition id is not
> needed here, because it's always the same. Removing it could allow us to save
> 2 bytes per chain element.
> * PayloadSize is a 4-bytes integer value that gives us the size of actual
> data in arbitrary format.
> * Payload - I expect it to be a serialized BinaryRow data. This is how it's
> implemented in RocksDB right now.
> For uncommitted (pending) entries I propose using maximal possible timestamp
> - {{{}(Long.MAX_VALUE, Long.MAX_VALUE){}}}. This will simplify things. Note
> that we never store tx id in chain itself.
> Overall, every chain element will have a (16 + 6 + 4 = 26) bytes header. It
> should be used as a header size in corresponding FreeList.
> h2. RowId pointer
> There's a requirement to have an immutable RowId for every versioned chain.
> One could argue that we should just make chain head immutable, but it would
> result in lots of complications. It's better to have a separate structure
> with immutable link, that will point to an actual head of the versioned chain.
> {code:java}
> [ TransactionId | HeadLink | NextLink ]{code}
> * TransactionId is a UUID. Can only be applied to pending entries. For
> committed head I propose storing 16 zeroes.
> * HeadLink is a link to the chain's head. Either 8 or 6 bytes. As already
> mentioned, I'd prefer 6.
> * NextLink is a "NextLink" value from the head chain element. It's a cheap
> shortcut for read-only transactions, you can skip uncommitted entry without
> even trying to read it, if there's a non-null transaction id. Debatable, I
> know, but looks cheap enough.
> In total, RowId is a 8 bytes link, pointing to a structure that has (16 + 6 +
> 6 = 28) bytes of data. There must e a separate FreeList for every partition
> even in In-Memory mode for reasons that I'll give later. "Header" size in
> that list must be equal to these 28 bytes. I wonder how effective FreeList
> will be for this case, where every chunk has the same size. We'll see. Maybe
> we should adjust a number of buckets somehow.
> h2. Data access and Full Scan
> Now, the fun part. There's no mention of B+Tree here. That's because we can
> probably just avoid it. If it existed, it would just point RowId to a
> described RowId structure in partition, but RowId is already a pointer
> itself. The only other problem that is usually solved by a tree-like
> structure is a full-scan of all rows in partition. This is useful when you
> need to rebuild indexes, for example.
> We should keep in mind that there's no code yet for rebuilding indexes. On
> the other hand, there's a method for partition scan in the API. This code
> could be used instead of Primary Index until we have it implemented.
> There's not FreeList full-scan currently in the code, it needs to be
> implemented. And, this particular full-scan is the reason why every partition
> should have its own list of row ids.
> There's also a chance that introducing new flag for row ids might be
> convenient. I don't know yet, let's not do it for now.
> Finally, we need an adequate protection from assertions if we, for some
> reason, have invalid row id. Things that can be checked be a normal code, not
> assertion:
> * data page type
> * number of items in the page
--
This message was sent by Atlassian Jira
(v8.20.7#820007)