[
https://issues.apache.org/jira/browse/IGNITE-13976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270245#comment-17270245
]
Ilya Kasnacheev commented on IGNITE-13976:
------------------------------------------
I will document a list of open questions and problems with the implementation
of WAL enable/disable:
* CacheGroupDescriptor has volatile walEnabled field. CacheGroupDescriptor
denotes configuration of cache group at some moment of time, such as, on node
join. It is not topology versioned. Therefore, if you have a
CacheGroupDescriptor.walEnabled "false", you have no idea in which topology
version it is enabled, and when you are asked to change it, you have no
guarantee that it is in sync with the rest of the cluster.
* CacheGroupContext also has localWalEnabled and globalWalEnabled. I'm not sure
what's the difference. Anyway, CacheGroupDescriptor.walEnabled is changed in a
different place and different thread than CacheGroupContext.globalWalEnabled -
former is set when it is decided that WAL change request should be handled,
latter when relevant checkpoint finishes. This means that node could already
have set CacheGroupDescriptor.walEnabled to false, already sending this value
to new nodes which try to join, but it did not set
CacheGroupContext.globalWalEnabled to false - it is still waiting on checkpoint
to do so.
* Enabling/disabling WAL will trigger checkpoint _on client nodes_. Despite
that fact, it has separate logic for server nodes. I'm not sure what does it do.
* When persistent node starts, it will read its current WAL disabled/enabled
status into maps initiallyLocalWalDisabledGrps/initiallyGlobalWalDisabledGrps.
It will, however, not clear values from these maps when cache is destroyed.
Meaning, when you will recreate such cache, node will still re-use old value of
WAL disabled for that cache. You may get cache that is in inconsistent state
from the start. The very existence of these maps outside of CacheGroupContext
or CacheGroupDescriptor is a liability. I have tried sidestepping it in
IGNITE-14039
* CacheGroupDescriptor has a list of WAL change requests. I don't understand
why we need to have a list of requests per cache, since every one of them
involves a PME, and therefore at any moment you need to do no more than one WAL
change for a cache - either you need to toggle it, or leave as it is.
* #onCachesInfoCollected will try to process the last WAL change request. I'm
not sure why it will discard the rest. For this last WAL change request it will
issue a "changed true" or "changed false" verdict, in order to presumably
notify the rest of nodes in the cluster. However, it does not make sense to
fail operation here because a newcomer node has ignored some of WAL change
requests and responded false to the last one only.
* Did I already mentioned that it does not make sense to have more than one WAL
change request per cache, or change any flag outside of PME blocking operation?
If you do this inside PME lock you may rely that all nodes do this in sync.
* On node startup, CacheGroupDescriptor.walEnabled is received from peer nodes,
representing WAL status at some unknown moment of time. Already bad, it is
compounded by the fact this walEnabled flag is never propagated to
cacheGroupContext. So the cache will start in "WAL enabled" mode regardless of
current cluster WAL mode for this cache.
* CacheGroupDescriptor.walEnabled and CacheGroupContext.globalWalEnabled are
checked in some random different places while WAL state change request is
handled. This is never documented.
* WalStateManager will produce the following warnings: "Received finish message
for unknown operation (will ignore)". However, it will not actually ignore
anything after this warning and continue this operation as usual.
* There's code which blindly toggles the value of "WAL enabled" without
checking that in this specific place it will be set to desired value
{code}
if (msg.changed())
grpDesc.walEnabled(!grpDesc.walEnabled());
{code}
I observe at least four failure scenarios for this code:
- Enabling or disabling WAL state when a baseline node was down. When node
comes back it is in incorrect WAL status. This is checked by sequential test
and always fails embarrasingly (see mailing list thread in IGNITE-14039). I
have tried fixing it in the PR.
- Unknown reason why on the client node WAL state will be different from the
server nodes, after WAL change is requested from this node.
- If server node is joining when WAL state change is underway, it is going to
get incorrect value in its CacheGroupDescriptor for that cache (too old or too
new compared to what the rest of nodes consider recent) and will get out of
sync.
- When cache is destroyed and recreated, nodes will take the wal enabled flag
from their maps populated at startup. See above. Leading to WAL state
inconsistent on the cache from the first moment.
How I would like this feature to work, and how I have imagined it working:
- WAL enable/disable status is tied to a specific topology version. For
example, (7,0) - WAL enabled on cache "data", (7, 1) - WAL disabled on cache
"data".
- WAL enable/disable operation is a PME, meaning that only one WAL
enable/disable operation may be happening at a given time. PME will start, then
all nodes will do checkpoint, they will change this flag, and then PME lock
will be released. Then the next PME operation, such as node join or another WAL
mode change, may happen.
- When node first joins, it will receive mapping of cache to WAL
enabled/disabled status for initial topology version.
- Then it has to initialize its caches and process all PME messages, it has to
get updated to the most recent topology version. This involves iterating
through all topology versions between initial and current, and updating WAL
enabled flag for caches if it is needed by any of those versions.
I think a lot of people always thought it works in this fashion, but it never
did and it doesn't now.
> WAL disable/enable with node restarts results in mismatching state, data loss
> -----------------------------------------------------------------------------
>
> Key: IGNITE-13976
> URL: https://issues.apache.org/jira/browse/IGNITE-13976
> Project: Ignite
> Issue Type: Bug
> Components: cache
> Affects Versions: 2.9.1
> Reporter: Ilya Kasnacheev
> Assignee: Ilya Kasnacheev
> Priority: Blocker
>
> If you try to enable/disable WAL on unstable topology, you will get to state
> when WAL status is undefined, nodes might have different wall status and the
> only way to fix it is to restart the cluster, which will lead to data loss
> because ignite removes data if WAL is disabled on restart.
> See the reproducer in PR.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)