[jira] [Updated] (IGNITE-18595) Implement index build process during the full state transfer

Ivan Bessonov (Jira) Mon, 23 Jan 2023 00:36:54 -0800


     [ 
https://issues.apache.org/jira/browse/IGNITE-18595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Bessonov updated IGNITE-18595:
-----------------------------------
    Description: 
Here there is no source of information for schema versions, associated with 
individual inserts. The core idea of the full rebalance is that all versions of 
all rows will be sent, while indexes will be rebuilt locally on the consumer. 
This is unfortunate. Why, you may ask.

Imagine the following situation:
 * time T1: table A with index X is created
 * time T2: user uploads the data
 * time T3: user drops index X
 * time T4: “clean” node N enters topology and downloads data via full 
rebalance procedure
 * time T5: N becomes a leader and receives (already running) RO transactions 
with timestamp T2<T<T3

Ideally, index X should be available for timestamp T. If the index is already 
available, it can’t suddenly become unavailable without an explicit rebuild 
request from the user (I guess).

The LATEST schema version at the moment of rebalance must be known. That’s 
unavoidable and makes total sense. First idea that comes to mind is updating 
all Registered and Available indexes. Situation, when an index has more indexed 
rows than it requires, is correct. Scan queries only return indexed rows that 
match corresponding value in the partition MV store. The real problem would be 
having less data than required.

The way that the approach is described in paragraph above is not quite correct. 
Let’s consider that there is a BinaryRow version. It defines a set of columns 
in the table at the moment of update. Not all row versions are compatible with 
all indexes. For example, you cannot put data into an index if a column has 
been deleted. On the other hand, you can put data in the index if a column has 
not yet been created (assuming it has a default value). In both cases the 
column is missing from the row version, but the outcome is very different.

This fact has some implications. A set of indexes to be updated depends on the 
row version for every particular row. I propose calculating it as a set of all 
indexes from a {_}maximal continuous range of db schemas{_}, that (if not 
empty) starts with the earliest known schema and _all schemas in the range have 
all indexed columns_ existing in the table.

For example, there’s a table T:
|DB schema version|Table columns|
|1|PK, A|
|2|PK, A, B|
|3 (LATEST)|PK, B|

 

In such configuration, ranges would be:
|Index columns|Schemas range|
|A|[1 ... 2]|
|B|[1 ... 3]|
|A, B|[1 ... 2]|

  was:tbd


> Implement index build process during the full state transfer
> ------------------------------------------------------------
>
>                 Key: IGNITE-18595
>                 URL: https://issues.apache.org/jira/browse/IGNITE-18595
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>
> Here there is no source of information for schema versions, associated with 
> individual inserts. The core idea of the full rebalance is that all versions 
> of all rows will be sent, while indexes will be rebuilt locally on the 
> consumer. This is unfortunate. Why, you may ask.
> Imagine the following situation:
>  * time T1: table A with index X is created
>  * time T2: user uploads the data
>  * time T3: user drops index X
>  * time T4: “clean” node N enters topology and downloads data via full 
> rebalance procedure
>  * time T5: N becomes a leader and receives (already running) RO transactions 
> with timestamp T2<T<T3
> Ideally, index X should be available for timestamp T. If the index is already 
> available, it can’t suddenly become unavailable without an explicit rebuild 
> request from the user (I guess).
> The LATEST schema version at the moment of rebalance must be known. That’s 
> unavoidable and makes total sense. First idea that comes to mind is updating 
> all Registered and Available indexes. Situation, when an index has more 
> indexed rows than it requires, is correct. Scan queries only return indexed 
> rows that match corresponding value in the partition MV store. The real 
> problem would be having less data than required.
> The way that the approach is described in paragraph above is not quite 
> correct. Let’s consider that there is a BinaryRow version. It defines a set 
> of columns in the table at the moment of update. Not all row versions are 
> compatible with all indexes. For example, you cannot put data into an index 
> if a column has been deleted. On the other hand, you can put data in the 
> index if a column has not yet been created (assuming it has a default value). 
> In both cases the column is missing from the row version, but the outcome is 
> very different.
> This fact has some implications. A set of indexes to be updated depends on 
> the row version for every particular row. I propose calculating it as a set 
> of all indexes from a {_}maximal continuous range of db schemas{_}, that (if 
> not empty) starts with the earliest known schema and _all schemas in the 
> range have all indexed columns_ existing in the table.
> For example, there’s a table T:
> |DB schema version|Table columns|
> |1|PK, A|
> |2|PK, A, B|
> |3 (LATEST)|PK, B|
>  
> In such configuration, ranges would be:
> |Index columns|Schemas range|
> |A|[1 ... 2]|
> |B|[1 ... 3]|
> |A, B|[1 ... 2]|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-18595) Implement index build process during the full state transfer

Reply via email to