[jira] [Commented] (IGNITE-16933) PageMemory-based MV storage implementation

Roman Puchkovskiy (Jira) Wed, 01 Jun 2022 10:28:04 -0700


    [ 
https://issues.apache.org/jira/browse/IGNITE-16933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545049#comment-17545049
 ]


Roman Puchkovskiy commented on IGNITE-16933:
--------------------------------------------

Thank you for your review!

> PageMemory-based MV storage implementation
> ------------------------------------------
>
>                 Key: IGNITE-16933
>                 URL: https://issues.apache.org/jira/browse/IGNITE-16933
>             Project: Ignite
>          Issue Type: New Feature
>            Reporter: Ivan Bessonov
>            Assignee: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-alpha5
>
>          Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> Similar to IGNITE-16611, we need an MV-storage implementation for page memory 
> storage engine. Currently, I expect only row storage implementation, without 
> primary or secondary indexes.
> h2. Chain Structure
> Here I'm going to describe a data format. Each row is stored as a versioned 
> chain. It will be represented by a number of data entries that will have 
> references to each other.
> {code:java}
> [ Timestamp | NextLink | PayloadSize | Payload ]{code}
>  * Timestamp is a 16 bytes value derived from 
> {{org.apache.ignite.internal.tx.Timestamp}} instance. It represents a commit 
> time of corresponding row.
>  * NextLink is a link to the next element in the chain or a NULL_LINK (or any 
> other convenient name). It's a long value in standard format for Page Memory 
> links (itemId, flag, partitionId, pageIdx). Technically, partition id is not 
> needed here, because it's always the same. Removing it could allow us to save 
> 2 bytes per chain element.
>  * PayloadSize is a 4-bytes integer value that gives us the size of actual 
> data in arbitrary format.
>  * Payload - I expect it to be a serialized BinaryRow data. This is how it's 
> implemented in RocksDB right now.
> For uncommitted (pending) entries I propose using maximal possible timestamp 
> - {{{}(Long.MAX_VALUE, Long.MAX_VALUE){}}}. This will simplify things. Note 
> that we never store tx id in chain itself.
> Overall, every chain element will have a (16 + 6 + 4 = 26) bytes header. It 
> should be used as a header size in corresponding FreeList.
> h2. RowId pointer
> There's a requirement to have an immutable RowId for every versioned chain. 
> One could argue that we should just make chain head immutable, but it would 
> result in lots of complications. It's better to have a separate structure 
> with immutable link, that will point to an actual head of the versioned chain.
> {code:java}
> [ TransactionId | HeadLink | NextLink ]{code}
>  * TransactionId is a UUID. Can only be applied to pending entries. For 
> committed head I propose storing 16 zeroes.
>  * HeadLink is a link to the chain's head. Either 8 or 6 bytes. As already 
> mentioned, I'd prefer 6.
>  * NextLink is a "NextLink" value from the head chain element. It's a cheap 
> shortcut for read-only transactions, you can skip uncommitted entry without 
> even trying to read it, if there's a non-null transaction id. Debatable, I 
> know, but looks cheap enough.
> In total, RowId is a 8 bytes link, pointing to a structure that has (16 + 6 + 
> 6 = 28) bytes of data. There must e a separate FreeList for every partition 
> even in In-Memory mode for reasons that I'll give later. "Header" size in 
> that list must be equal to these 28 bytes. I wonder how effective FreeList 
> will be for this case, where every chunk has the same size. We'll see. Maybe 
> we should adjust a number of buckets somehow.
> h2. Data access and Full Scan
> Now, the fun part. There's no mention of B+Tree here. That's because we can 
> probably just avoid it. If it existed, it would just point RowId to a 
> described RowId structure in partition, but RowId is already a pointer 
> itself. The only other problem that is usually solved by a tree-like 
> structure is a full-scan of all rows in partition. This is useful when you 
> need to rebuild indexes, for example.
> We should keep in mind that there's no code yet for rebuilding indexes. On 
> the other hand, there's a method for partition scan in the API. This code 
> could be used instead of Primary Index until we have it implemented.
> There's not FreeList full-scan currently in the code, it needs to be 
> implemented. And, this particular full-scan is the reason why every partition 
> should have its own list of row ids.
> There's also a chance that introducing new flag for row ids might be 
> convenient. I don't know yet, let's not do it for now.
> Finally, we need an adequate protection from assertions if we, for some 
> reason, have invalid row id. Things that can be checked be a normal code, not 
> assertion:
>  * data page type
>  * number of items in the page



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (IGNITE-16933) PageMemory-based MV storage implementation

Reply via email to