[
https://issues.apache.org/jira/browse/CASSANDRA-12438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benedict Elliott Smith updated CASSANDRA-12438:
-----------------------------------------------
Component/s: Feature/Lightweight Transactions
> Data inconsistencies with lightweight transactions, serial reads, and
> rejoining node
> ------------------------------------------------------------------------------------
>
> Key: CASSANDRA-12438
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12438
> Project: Cassandra
> Issue Type: Bug
> Components: Feature/Lightweight Transactions, Legacy/Coordination
> Reporter: Steven Schaefer
> Priority: Normal
>
> I've run into some issues with data inconsistency in a situation where a
> single node is rejoining a 3-node cluster with RF=3. I'm running 3.7.
> I have a client system which inserts data into a table with around 7 columns,
> named let's say A-F,id, and version. LWTs are used to make the inserts and
> updates.
> Typically what happens is there's an insert of values id, V_a1, V_b1, ... ,
> version=1, then another process will pick up rows with for example A=V_a1 and
> subsequently update A to V_a2 and version=2. Yet another process will watch
> for A=V_a2 to then make a second update to the same column, and set version
> to 3, with end result being <id, V_a3, V_b1, ... , V_f1, version=3> There's a
> secondary index on this A column (there's only a few possible values for A so
> not worried about the cardinality issue), though I've reproed with the new
> SASI index too.
> If one of the nodes is down, there's still 2 alive for quorum so inserts can
> still happen. When I bring up the downed node, sometimes I get really weird
> state back which ultimately crashes the client system that's talking to
> Cassandra.
> When reading I always select all the columns, but there is a conditional
> where clause that A=V_a2 (e.g. SELECT * FROM table WHERE A=V_a2). This read
> is for processing any rows with V_a2, and ultimately updating to V_a3 when
> complete. While periodically polling for A=V_a2 it is of course possible for
> the poller to to observe the old V_a2 value while the other parts of the
> client system process and make the update to V_a3, and that's generally ok
> because of the LWTs used for updates, an occassionaly wasted reprocessing run
> ins't a big deal, but when reading at serial I always expect to get the
> original values for columns that were never updated too. If a paxos update is
> in progress then I expect that completed before its value(s) returned. But
> instead, the read seems to be seeing the partial commit of the LWT, returning
> the old V_2a value for the changed column, but no values whatsoever for the
> other columns. From the example above, instead of getting <id, V_a3 V_b1, ...
> , version=3>, or even the older <id, V_a2, V_b1, ..., version=2> (either of
> which I expect and are ok), I get only <id, V_a2, version=2>, so the rest of
> the columns end up null, which I never expect. However this isn't persistent,
> Cassandra does end up consistent, which I see via sstabledump and cqlsh after
> the fact.
> In my client system logs I record the insert / updates, and this
> inconsistency happens around the same time as the update from V_a2 to V_a3,
> hence my comment about Cassandra seeing a partial commit. So that leads me to
> suspect that perhaps due to the where clause in my read query for A=V_a2,
> perhaps one of the original good nodes already has the new V_a3 value, so it
> doesn't return this row for the select query, but the other good node and the
> one that was down still have the old value V_a2, so those 2 nodes return what
> they have. The one that was down doesn't yet have the original insert, just
> the update from V_a1 -> V_a2 (again I suspect, it's not been easy to verify),
> which would explain where <id, V_a2, version=2> comes from, that's all it
> knows about. However since it's a serial quorum read, I'd expect some sort of
> exception as neither of the remaining 2 nodes with A=V_a2 would be able to
> come to a quorum on the values for all the columns, as I'd expect the other
> good node to return <id, V_a2, V_b1, ..., version=2>
> I know at some point nodetool repair should be run on this node, but I'm
> concerned about a window of time between when the node comes back up and
> repair starts/completes. It almost seems like if a node goes down the safest
> bet is to remove it from the cluster and rebuild, instead of simply
> restarting the node? However I haven't tested that to see if it runs into a
> similar situation.
> It is of course possible to work around the inconsistency for now by
> detecting and ignoring it in the client system, but if there is indeed a bug
> I hope we can identify it and ultimately resolve it.
> I'm also curious if this relates to CASSANDRA-12126, and also CASSANDRA-11219
> may be relevant.
> I've been reproducing with a combination of manually stopping one node,
> running a test script I have to trigger the client system to insert data,
> then manually restarting the node and waiting. It's consistently
> inconsistent, reproducing on most attempts
> Summary timeline:
> {noformat}
> 1. Shut down third node of three.
> 2. Insert <id, V_a1, V_b1, ... , version=1> (along with many others)
> 3. Start the third node. (start occurs concurrently with 4 & 5)
> 4. Update <id, V_a2, V_b1, ... , version=2> (along with others for which
> A=V_a1)
> 5. Update <id, V_a3, V_b1, ... , version=3> (along with many others for
> which A=V_a2)
> 5. Read
> a. Expected: <id, V_a2, V_b1, ... , version=2> OR <id, V_a3, V_b1,
> ... , version=3>
> b. Actual: <id, V_a2, version=2> // some fields are null
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]