Hi,

We've done this upgrade in both dev and stage before and we did not see
similar issues.
After upgrading production today we have a lot issues tho.

The main issue is that the Datastax client quite often does not get the
data (even though it's the same query). I see similar flakyness by simply
running cqlsh, although it does return it returns broken data.

We are running a 3 node cluster with RF 3.

I have this table

CREATE TABLE keyspace.table (

  a text,

    b text,

    c text,

    d list<text>,

    e text,

    f timestamp,

    g list<text>,

    h timestamp,

    PRIMARY KEY (a, b, c)

)


Every other time I query (not exactly every other time, but random) I get:


SELECT * from table where a = 'xxx' and b = 'xxx'

 a             | b | c                                 | d | e | f
                | g            | h

---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------

 xxx |          xxx | ccc |             null |       null | 2089-11-30
23:00:00.000000+0000 | ['fff'] | 2014-12-31 23:00:00.000000+0000

 xxx |          xxx |                           ddd |             null |
    null | 2099-01-01 00:00:00.000000+0000 | ['fff'] | 2016-06-17
13:29:36.000000+0000


Which is the expected output.


But I also get:

 a             | b | c                                 | d | e | f
                | g            | h

---------------------+--------------+-----------------------------------------------+------------------+------------+---------------------------------+-----------------------+---------------------------------

 xxx |          xxx | ccc |             null |       null |
            null |                  null |                            null

 xxx |          xxx | ccc |             null |       null | 2089-11-30
23:00:00.000000+0000 | ['fff'] |                            null

 xxx |          xxx | ccc |             null |       null |
            null |                  null | 2014-12-31 23:00:00.000000+0000

 xxx |          xxx |                           ddd |             null |
    null |                            null |                  null |
                    null

 xxx |          xxx |                           ddd |             null |
    null | 2099-01-01 00:00:00.000000+0000 | ['fff'] |
      null

 xxx |          xxx |                           ddd |             null |
    null |                            null |                  null | 2016-06-17
13:29:36.000000+0000


Notice that the same PK is returned 3 times. With different parts of the
data. I believe this is what's currently killing our production environment.


I'm running upgradesstables as of this moment, but it's not finished yet. I
started a repair before but nothing happened. The upgradesstables finished
now on 2 out of 3 nodes, but production is still down :/


We also see these in the logs, over and over again:

DEBUG [ReadRepairStage:4] 2016-06-21 15:44:01,119 ReadCallback.java:235 -
Digest mismatch:

org.apache.cassandra.service.DigestMismatchException: Mismatch for key
DecoratedKey(-1566729966326640413, 336b35356c49537731797a4a5f64627a797236)
(b3dcfcbeed6676eae7ff88cc1bd251fb vs 6e7e9225871374d68a7cdb54ae70726d)

at
org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85)
~[apache-cassandra-3.5.0.jar:3.5.0]

at
org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:226)
~[apache-cassandra-3.5.0.jar:3.5.0]

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_72]

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_72]

at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]


Any help is much appreciated

Reply via email to