Sebastian Marsching created CASSANDRA-11000:
-----------------------------------------------

             Summary: Mixing LWT and non-LWT operations can result in an LWT 
operation being acknowledged but not applied
                 Key: CASSANDRA-11000
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11000
             Project: Cassandra
          Issue Type: Bug
          Components: Coordination
         Environment: Cassandra 2.1, 2.2, and 3.0 on Linux and OS X.
            Reporter: Sebastian Marsching


When mixing light-weight transaction (LWT, a.k.a. compare-and-set, conditional 
update) operations with regular operations, it can happen that an LWT operation 
is acknowledged (applied = True), even though the update has not been applied 
and a SELECT operation still returns the old data.

For example, consider the following table:

CREATE TABLE test (
    pk text,
    ck text,
    v text,
    PRIMARY KEY (pk, ck)
);

We start with an empty table and insert data using a regular (non-LWT) 
operation:

INSERT INTO test (pk, ck, v) VALUES ('foo', 'bar', '123');

A following SELECT statement returns the data as expected. Now we do a 
conditional update (LWT):

UPDATE test SET v = '456' WHERE pk = 'foo' AND ck = 'bar' IF v = '123';

As expected, the update is applied and a following SELECT statement shows the 
updated value.

Now we do the same but use a time stamp that is slightly in the future (e.g. a 
few seconds) for the INSERT statement (obviously $time$ needs to be replaced by 
a time stamp that is slightly ahead of the system clock).

INSERT INTO test (pk, ck, v) VALUES ('foo', 'bar', '123') USING TIMESTAMP 
$time$;

Now, running the same UPDATE statement still report success (applied = True). 
However, a subsequent SELECT yields the old value ('123') instead of the 
updated value ('456'). Inspecting the time stamp of the value indicates that it 
has not been replaced (the value from the original INSERT is still in place).

This behavior is exhibited in an single-node cluster running Cassandra 2.1.11, 
2.2.4, and 3.0.1.

Testing this for a multi-node cluster is a bit more tricky, so I only tested it 
with Cassandra 2.2.4. Here, I made one of the nodes lack behind in time for a 
few seconds (using libfaketime). I used a replication factor of three for the 
test keyspace. In this case, the behavior can be demonstrated even without 
using an explicitly specified time stamp. Running

INSERT INTO test (pk, ck, v) VALUES ('foo', 'bar', '123');

on a node with the regular clock followed by

UPDATE test SET v = '456' WHERE pk = 'foo' AND ck = 'bar' IF v = '123';

on the node lagging behind results in the UPDATE to report success, but the old 
value still being used.

Interestingly, everything works as expected if using LWT operations 
consistently: When running

UPDATE test SET v = '456' WHERE pk = 'foo' AND ck = 'bar' IF v = '123';
UPDATE test SET v = '123' WHERE pk = 'foo' AND ck = 'bar' IF v = '456';

in an alternating fashion on two nodes (one with a "normal" clock, one with the 
clock lagging behind), the updates are applied as expected. When checking the 
time stamps ("SELECT WRITETIME(v) FROM test;"), one can see that the time stamp 
is increased by just a single tick when the statement is executed on the node 
lagging behind.

I think that this problem is strongly related to (or maybe even the same as) 
the one described in CASSANDRA-7801, even though CASSANDRA-7801 was mainly 
concerned about a single-node cluster. However, the fact that this problem 
still exists in current versions of Cassandra makes me suspect that either it 
is a different problem or the original problem was not fixed completely with 
the patch from CASSANDRA-7801.

I found CASSANDRA-9655 which suggest removing the changes introduced with 
CASSANDRA-7801 because they can be problematic under certain circumstances, but 
I am not sure whether this is the right place to discuss the issue I am 
experiencing. If you feel so, feel free to close this issue and update the 
description of CASSANDRA-9655.

In my opinion, the best way to fix this problem would be ensuring that a write 
that is part of a LWT always uses a time stamp that is at least one tick 
greater than the time stamp of the existing data. As the existing data has to 
be read for checking the condition anyway, I do not think that this would cause 
an additional overhead. If this is not possible, I suggest to look into whether 
we can somehow detect such a situation and at least report failure (applied = 
False) on the LWT instead of reporting success.

The latter solution would at least fix those cases where code checks the 
success of a LWT before performing any further actions (e.g. because the LWT is 
used to take some kind of lock). Currently, the code will assume that the 
operation was successful (and thus - staying in the example - it owns the 
lock), while other processes running in parallel will see a different state. It 
is my understanding that LWTs were designed to avoid exactly this situation, 
but at the moment the assumptions most users will make about LWTs do not always 
hold.

Until this issue is solved, I suggest at least updating the CQL documentation 
and clearly stating that LWTs / conditional updates are not safe if data has 
been previously INSERTed / UPDATEd / DELETEd using non-LWT operations and there 
is a clock skew or time stamps that are in the future have been supplied 
explicitly. This should at least save some users from making wrong assumptions 
about LWTs and not realizing it until their application fails in an unsafe way.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to