[jira] [Commented] (IGNITE-13976) WAL disable/enable with node restarts results in mismatching state, data loss

Ilya Kasnacheev (Jira) Fri, 22 Jan 2021 07:58:36 -0800


    [ 
https://issues.apache.org/jira/browse/IGNITE-13976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270245#comment-17270245
 ]


Ilya Kasnacheev commented on IGNITE-13976:
------------------------------------------

I will document a list of open questions and problems with the implementation 
of WAL enable/disable:

* CacheGroupDescriptor has volatile walEnabled field. CacheGroupDescriptor 
denotes configuration of cache group at some moment of time, such as, on node 
join. It is not topology versioned. Therefore, if you have a 
CacheGroupDescriptor.walEnabled "false", you have no idea in which topology 
version it is enabled, and when you are asked to change it, you have no 
guarantee that it is in sync with the rest of the cluster.
* CacheGroupContext also has localWalEnabled and globalWalEnabled. I'm not sure 
what's the difference. Anyway, CacheGroupDescriptor.walEnabled is changed in a 
different place and different thread than CacheGroupContext.globalWalEnabled - 
former is set when it is decided that WAL change request should be handled, 
latter when relevant checkpoint finishes. This means that node could already 
have set CacheGroupDescriptor.walEnabled to false, already sending this value 
to new nodes which try to join, but it did not set 
CacheGroupContext.globalWalEnabled to false - it is still waiting on checkpoint 
to do so.
* Enabling/disabling WAL will trigger checkpoint _on client nodes_. Despite 
that fact, it has separate logic for server nodes. I'm not sure what does it do.
* When persistent node starts, it will read its current WAL disabled/enabled 
status into maps initiallyLocalWalDisabledGrps/initiallyGlobalWalDisabledGrps. 
It will, however, not clear values from these maps when cache is destroyed. 
Meaning, when you will recreate such cache, node will still re-use old value of 
WAL disabled for that cache. You may get cache that is in inconsistent state 
from the start. The very existence of these maps outside of CacheGroupContext 
or CacheGroupDescriptor is a liability. I have tried sidestepping it in 
IGNITE-14039
* CacheGroupDescriptor has a list of  WAL change requests. I don't understand 
why we need to have a list of requests per cache, since every one of them 
involves a PME, and therefore at any moment you need to do no more than one WAL 
change for a cache - either you need to toggle it, or leave as it is.
* #onCachesInfoCollected will try to process the last WAL change request. I'm 
not sure why it will discard the rest. For this last WAL change request it will 
issue a "changed true" or "changed false" verdict, in order to presumably 
notify the rest of nodes in the cluster. However, it does not make sense to 
fail operation here because a newcomer node has ignored some of WAL change 
requests and responded false to the last one only.
* Did I already mentioned that it does not make sense to have more than one WAL 
change request per cache, or change any flag outside of PME blocking operation? 
If you do this inside PME lock you may rely that all nodes do this in sync.
* On node startup, CacheGroupDescriptor.walEnabled is received from peer nodes, 
representing WAL status at some unknown moment of time. Already bad, it is 
compounded by the fact this walEnabled flag is never propagated to 
cacheGroupContext. So the cache will start in "WAL enabled" mode regardless of 
current cluster WAL mode for this cache.
* CacheGroupDescriptor.walEnabled and CacheGroupContext.globalWalEnabled are 
checked in some random different places while WAL state change request is 
handled. This is never documented.
* WalStateManager will produce the following warnings: "Received finish message 
for unknown operation (will ignore)". However, it will not actually ignore 
anything after this warning and continue this operation as usual.
* There's code which blindly toggles the value of "WAL enabled" without 
checking that in this specific place it will be set to desired value
{code}
                if (msg.changed())
                    grpDesc.walEnabled(!grpDesc.walEnabled());
{code}

I observe at least four failure scenarios for this code:
- Enabling or disabling WAL state when a baseline node was down. When node 
comes back it is in incorrect WAL status. This is checked by sequential test 
and always fails embarrasingly (see mailing list thread in IGNITE-14039). I 
have tried fixing it in the PR.
- Unknown reason why on the client node WAL state will be different from the 
server nodes, after WAL change is requested from this node.
- If server node is joining when WAL state change is underway, it is going to 
get incorrect value in its CacheGroupDescriptor for that cache (too old or too 
new compared to what the rest of nodes consider recent) and will get out of 
sync.
- When cache is destroyed and recreated, nodes will take the wal enabled flag 
from their maps populated at startup. See above. Leading to WAL state 
inconsistent on the cache from the first moment.

How I would like this feature to work, and how I have imagined it working:
- WAL enable/disable status is tied to a specific topology version. For 
example, (7,0) - WAL enabled on cache "data", (7, 1) - WAL disabled on cache 
"data".
- WAL enable/disable operation is a PME, meaning that only one WAL 
enable/disable operation may be happening at a given time. PME will start, then 
all nodes will do checkpoint, they will change this flag, and then PME lock 
will be released. Then the next PME operation, such as node join or another WAL 
mode change, may happen.
- When node first joins, it will receive mapping of cache to WAL 
enabled/disabled status for initial topology version.
- Then it has to initialize its caches and process all PME messages, it has to 
get updated to the most recent topology version. This involves iterating 
through all topology versions between initial and current, and updating WAL 
enabled flag for caches if it is needed by any of those versions.

I think a lot of people always thought it works in this fashion, but it never 
did and it doesn't now.

> WAL disable/enable with node restarts results in mismatching state, data loss
> -----------------------------------------------------------------------------
>
>                 Key: IGNITE-13976
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13976
>             Project: Ignite
>          Issue Type: Bug
>          Components: cache
>    Affects Versions: 2.9.1
>            Reporter: Ilya Kasnacheev
>            Assignee: Ilya Kasnacheev
>            Priority: Blocker
>
> If you try to enable/disable WAL on unstable topology, you will get to state 
> when WAL status is undefined, nodes might have different wall status and the 
> only way to fix it is to restart the cluster, which will lead to data loss 
> because ignite removes data if WAL is disabled on restart.
> See the reproducer in PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IGNITE-13976) WAL disable/enable with node restarts results in mismatching state, data loss

Reply via email to