Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

2016-06-21 Thread Bryan Cheng
Hi Oskar,

I know this won't help you as quickly as you would like but please consider
updating the JIRA issue with details of your environment as it may help
move the investigation along.

Good luck!

On Tue, Jun 21, 2016 at 12:21 PM, Julien Anguenot 
wrote:

> You could try to sstabledump that one corrupted table, write some
> (Python) code to get rid of the duplicates processing that stabledump
> output (might not be bullet proof depending on data, I agree),
> truncate and re-insert them back in that table without duplicates.
>
> On Tue, Jun 21, 2016 at 11:52 AM, Oskar Kjellin 
> wrote:
> > Hmm, no way we can do that in prod :/
> >
> > Sent from my iPhone
> >
> >> On 21 juni 2016, at 18:50, Julien Anguenot  wrote:
> >>
> >> See my comments on the issue: I had to truncate and reinsert data in
> >> these corrupted tables.
> >>
> >> AFAIK, there is no evidence that UDTs are responsible of this bad
> behavior.
> >>
> >>> On Tue, Jun 21, 2016 at 11:45 AM, Oskar Kjellin <
> oskar.kjel...@gmail.com> wrote:
> >>> Yea I saw that one. We're not using UDT in the affected tables tho.
> >>>
> >>> Did you resolve it?
> >>>
> >>> Sent from my iPhone
> >>>
>  On 21 juni 2016, at 18:27, Julien Anguenot 
> wrote:
> 
>  I have experienced similar duplicate primary keys behavior with couple
>  of tables after upgrading from 2.2.x to 3.0.x.
> 
>  See comments on the Jira issue I opened at the time over there:
>  https://issues.apache.org/jira/browse/CASSANDRA-11887
> 
> 
> > On Tue, Jun 21, 2016 at 10:47 AM, Oskar Kjellin <
> oskar.kjel...@gmail.com> wrote:
> > Hi,
> >
> > We've done this upgrade in both dev and stage before and we did not
> see
> > similar issues.
> > After upgrading production today we have a lot issues tho.
> >
> > The main issue is that the Datastax client quite often does not get
> the data
> > (even though it's the same query). I see similar flakyness by simply
> running
> > cqlsh, although it does return it returns broken data.
> >
> > We are running a 3 node cluster with RF 3.
> >
> > I have this table
> >
> > CREATE TABLE keyspace.table (
> >
> > a text,
> >
> >   b text,
> >
> >   c text,
> >
> >   d list,
> >
> >   e text,
> >
> >   f timestamp,
> >
> >   g list,
> >
> >   h timestamp,
> >
> >   PRIMARY KEY (a, b, c)
> >
> > )
> >
> >
> > Every other time I query (not exactly every other time, but random)
> I get:
> >
> >
> > SELECT * from table where a = 'xxx' and b = 'xxx'
> >
> > a | b | c | d | e | f
> > | g| h
> >
> >
> -+--+---+--++-+---+-
> >
> > xxx |  xxx | ccc | null |   null | 2089-11-30
> > 23:00:00.00+ | ['fff'] | 2014-12-31 23:00:00.00+
> >
> > xxx |  xxx |   ddd |
>  null |
> > null | 2099-01-01 00:00:00.00+ | ['fff'] | 2016-06-17
> > 13:29:36.00+
> >
> >
> > Which is the expected output.
> >
> >
> > But I also get:
> >
> > a | b | c | d | e | f
> > | g| h
> >
> >
> -+--+---+--++-+---+-
> >
> > xxx |  xxx | ccc | null |   null |
> > null |  null |null
> >
> > xxx |  xxx | ccc | null |   null | 2089-11-30
> > 23:00:00.00+ | ['fff'] |null
> >
> > xxx |  xxx | ccc | null |   null |
> > null |  null | 2014-12-31 23:00:00.00+
> >
> > xxx |  xxx |   ddd |
>  null |
> > null |null |  null |
> > null
> >
> > xxx |  xxx |   ddd |
>  null |
> > null | 2099-01-01 00:00:00.00+ | ['fff'] |
> > null
> >
> > xxx |  xxx |   ddd |
>  null |
> > null |null |  null |
> 2016-06-17
> > 13:29:36.00+
> >
> >
> > Notice that the same PK is returned 3 times. With different parts of
> the
> > data. I believe this is what's currently killing our production
> environment.
> >
> >
> > I'm 

Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

2016-06-21 Thread Julien Anguenot
You could try to sstabledump that one corrupted table, write some
(Python) code to get rid of the duplicates processing that stabledump
output (might not be bullet proof depending on data, I agree),
truncate and re-insert them back in that table without duplicates.

On Tue, Jun 21, 2016 at 11:52 AM, Oskar Kjellin  wrote:
> Hmm, no way we can do that in prod :/
>
> Sent from my iPhone
>
>> On 21 juni 2016, at 18:50, Julien Anguenot  wrote:
>>
>> See my comments on the issue: I had to truncate and reinsert data in
>> these corrupted tables.
>>
>> AFAIK, there is no evidence that UDTs are responsible of this bad behavior.
>>
>>> On Tue, Jun 21, 2016 at 11:45 AM, Oskar Kjellin  
>>> wrote:
>>> Yea I saw that one. We're not using UDT in the affected tables tho.
>>>
>>> Did you resolve it?
>>>
>>> Sent from my iPhone
>>>
 On 21 juni 2016, at 18:27, Julien Anguenot  wrote:

 I have experienced similar duplicate primary keys behavior with couple
 of tables after upgrading from 2.2.x to 3.0.x.

 See comments on the Jira issue I opened at the time over there:
 https://issues.apache.org/jira/browse/CASSANDRA-11887


> On Tue, Jun 21, 2016 at 10:47 AM, Oskar Kjellin  
> wrote:
> Hi,
>
> We've done this upgrade in both dev and stage before and we did not see
> similar issues.
> After upgrading production today we have a lot issues tho.
>
> The main issue is that the Datastax client quite often does not get the 
> data
> (even though it's the same query). I see similar flakyness by simply 
> running
> cqlsh, although it does return it returns broken data.
>
> We are running a 3 node cluster with RF 3.
>
> I have this table
>
> CREATE TABLE keyspace.table (
>
> a text,
>
>   b text,
>
>   c text,
>
>   d list,
>
>   e text,
>
>   f timestamp,
>
>   g list,
>
>   h timestamp,
>
>   PRIMARY KEY (a, b, c)
>
> )
>
>
> Every other time I query (not exactly every other time, but random) I get:
>
>
> SELECT * from table where a = 'xxx' and b = 'xxx'
>
> a | b | c | d | e | f
> | g| h
>
> -+--+---+--++-+---+-
>
> xxx |  xxx | ccc | null |   null | 2089-11-30
> 23:00:00.00+ | ['fff'] | 2014-12-31 23:00:00.00+
>
> xxx |  xxx |   ddd | null |
> null | 2099-01-01 00:00:00.00+ | ['fff'] | 2016-06-17
> 13:29:36.00+
>
>
> Which is the expected output.
>
>
> But I also get:
>
> a | b | c | d | e | f
> | g| h
>
> -+--+---+--++-+---+-
>
> xxx |  xxx | ccc | null |   null |
> null |  null |null
>
> xxx |  xxx | ccc | null |   null | 2089-11-30
> 23:00:00.00+ | ['fff'] |null
>
> xxx |  xxx | ccc | null |   null |
> null |  null | 2014-12-31 23:00:00.00+
>
> xxx |  xxx |   ddd | null |
> null |null |  null |
> null
>
> xxx |  xxx |   ddd | null |
> null | 2099-01-01 00:00:00.00+ | ['fff'] |
> null
>
> xxx |  xxx |   ddd | null |
> null |null |  null | 
> 2016-06-17
> 13:29:36.00+
>
>
> Notice that the same PK is returned 3 times. With different parts of the
> data. I believe this is what's currently killing our production 
> environment.
>
>
> I'm running upgradesstables as of this moment, but it's not finished yet. 
> I
> started a repair before but nothing happened. The upgradesstables finished
> now on 2 out of 3 nodes, but production is still down :/
>
>
> We also see these in the logs, over and over again:
>
> DEBUG [ReadRepairStage:4] 2016-06-21 15:44:01,119 ReadCallback.java:235 -
> Digest mismatch:
>
> org.apache.cassandra.service.DigestMismatchException: Mismatch for key

Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

2016-06-21 Thread Julien Anguenot
AFAICT, the issue does not seem to be driver related as the duplicates
where showing up both using the cqlsh and Java driver. In addition,
the sstabledump output contained the actual duplicates. (see Jira
issue)

On Tue, Jun 21, 2016 at 12:04 PM, Oskar Kjellin  wrote:
> Did you see similar issues when querying using a driver? Because we get no 
> results in the driver what so ever
>
> Sent from my iPhone
>
>> On 21 juni 2016, at 18:50, Julien Anguenot  wrote:
>>
>> See my comments on the issue: I had to truncate and reinsert data in
>> these corrupted tables.
>>
>> AFAIK, there is no evidence that UDTs are responsible of this bad behavior.
>>
>>> On Tue, Jun 21, 2016 at 11:45 AM, Oskar Kjellin  
>>> wrote:
>>> Yea I saw that one. We're not using UDT in the affected tables tho.
>>>
>>> Did you resolve it?
>>>
>>> Sent from my iPhone
>>>
 On 21 juni 2016, at 18:27, Julien Anguenot  wrote:

 I have experienced similar duplicate primary keys behavior with couple
 of tables after upgrading from 2.2.x to 3.0.x.

 See comments on the Jira issue I opened at the time over there:
 https://issues.apache.org/jira/browse/CASSANDRA-11887


> On Tue, Jun 21, 2016 at 10:47 AM, Oskar Kjellin  
> wrote:
> Hi,
>
> We've done this upgrade in both dev and stage before and we did not see
> similar issues.
> After upgrading production today we have a lot issues tho.
>
> The main issue is that the Datastax client quite often does not get the 
> data
> (even though it's the same query). I see similar flakyness by simply 
> running
> cqlsh, although it does return it returns broken data.
>
> We are running a 3 node cluster with RF 3.
>
> I have this table
>
> CREATE TABLE keyspace.table (
>
> a text,
>
>   b text,
>
>   c text,
>
>   d list,
>
>   e text,
>
>   f timestamp,
>
>   g list,
>
>   h timestamp,
>
>   PRIMARY KEY (a, b, c)
>
> )
>
>
> Every other time I query (not exactly every other time, but random) I get:
>
>
> SELECT * from table where a = 'xxx' and b = 'xxx'
>
> a | b | c | d | e | f
> | g| h
>
> -+--+---+--++-+---+-
>
> xxx |  xxx | ccc | null |   null | 2089-11-30
> 23:00:00.00+ | ['fff'] | 2014-12-31 23:00:00.00+
>
> xxx |  xxx |   ddd | null |
> null | 2099-01-01 00:00:00.00+ | ['fff'] | 2016-06-17
> 13:29:36.00+
>
>
> Which is the expected output.
>
>
> But I also get:
>
> a | b | c | d | e | f
> | g| h
>
> -+--+---+--++-+---+-
>
> xxx |  xxx | ccc | null |   null |
> null |  null |null
>
> xxx |  xxx | ccc | null |   null | 2089-11-30
> 23:00:00.00+ | ['fff'] |null
>
> xxx |  xxx | ccc | null |   null |
> null |  null | 2014-12-31 23:00:00.00+
>
> xxx |  xxx |   ddd | null |
> null |null |  null |
> null
>
> xxx |  xxx |   ddd | null |
> null | 2099-01-01 00:00:00.00+ | ['fff'] |
> null
>
> xxx |  xxx |   ddd | null |
> null |null |  null | 
> 2016-06-17
> 13:29:36.00+
>
>
> Notice that the same PK is returned 3 times. With different parts of the
> data. I believe this is what's currently killing our production 
> environment.
>
>
> I'm running upgradesstables as of this moment, but it's not finished yet. 
> I
> started a repair before but nothing happened. The upgradesstables finished
> now on 2 out of 3 nodes, but production is still down :/
>
>
> We also see these in the logs, over and over again:
>
> DEBUG [ReadRepairStage:4] 2016-06-21 15:44:01,119 ReadCallback.java:235 -
> Digest mismatch:
>
> 

Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

2016-06-21 Thread Oskar Kjellin
Did you see similar issues when querying using a driver? Because we get no 
results in the driver what so ever 

Sent from my iPhone

> On 21 juni 2016, at 18:50, Julien Anguenot  wrote:
> 
> See my comments on the issue: I had to truncate and reinsert data in
> these corrupted tables.
> 
> AFAIK, there is no evidence that UDTs are responsible of this bad behavior.
> 
>> On Tue, Jun 21, 2016 at 11:45 AM, Oskar Kjellin  
>> wrote:
>> Yea I saw that one. We're not using UDT in the affected tables tho.
>> 
>> Did you resolve it?
>> 
>> Sent from my iPhone
>> 
>>> On 21 juni 2016, at 18:27, Julien Anguenot  wrote:
>>> 
>>> I have experienced similar duplicate primary keys behavior with couple
>>> of tables after upgrading from 2.2.x to 3.0.x.
>>> 
>>> See comments on the Jira issue I opened at the time over there:
>>> https://issues.apache.org/jira/browse/CASSANDRA-11887
>>> 
>>> 
 On Tue, Jun 21, 2016 at 10:47 AM, Oskar Kjellin  
 wrote:
 Hi,
 
 We've done this upgrade in both dev and stage before and we did not see
 similar issues.
 After upgrading production today we have a lot issues tho.
 
 The main issue is that the Datastax client quite often does not get the 
 data
 (even though it's the same query). I see similar flakyness by simply 
 running
 cqlsh, although it does return it returns broken data.
 
 We are running a 3 node cluster with RF 3.
 
 I have this table
 
 CREATE TABLE keyspace.table (
 
 a text,
 
   b text,
 
   c text,
 
   d list,
 
   e text,
 
   f timestamp,
 
   g list,
 
   h timestamp,
 
   PRIMARY KEY (a, b, c)
 
 )
 
 
 Every other time I query (not exactly every other time, but random) I get:
 
 
 SELECT * from table where a = 'xxx' and b = 'xxx'
 
 a | b | c | d | e | f
 | g| h
 
 -+--+---+--++-+---+-
 
 xxx |  xxx | ccc | null |   null | 2089-11-30
 23:00:00.00+ | ['fff'] | 2014-12-31 23:00:00.00+
 
 xxx |  xxx |   ddd | null |
 null | 2099-01-01 00:00:00.00+ | ['fff'] | 2016-06-17
 13:29:36.00+
 
 
 Which is the expected output.
 
 
 But I also get:
 
 a | b | c | d | e | f
 | g| h
 
 -+--+---+--++-+---+-
 
 xxx |  xxx | ccc | null |   null |
 null |  null |null
 
 xxx |  xxx | ccc | null |   null | 2089-11-30
 23:00:00.00+ | ['fff'] |null
 
 xxx |  xxx | ccc | null |   null |
 null |  null | 2014-12-31 23:00:00.00+
 
 xxx |  xxx |   ddd | null |
 null |null |  null |
 null
 
 xxx |  xxx |   ddd | null |
 null | 2099-01-01 00:00:00.00+ | ['fff'] |
 null
 
 xxx |  xxx |   ddd | null |
 null |null |  null | 2016-06-17
 13:29:36.00+
 
 
 Notice that the same PK is returned 3 times. With different parts of the
 data. I believe this is what's currently killing our production 
 environment.
 
 
 I'm running upgradesstables as of this moment, but it's not finished yet. I
 started a repair before but nothing happened. The upgradesstables finished
 now on 2 out of 3 nodes, but production is still down :/
 
 
 We also see these in the logs, over and over again:
 
 DEBUG [ReadRepairStage:4] 2016-06-21 15:44:01,119 ReadCallback.java:235 -
 Digest mismatch:
 
 org.apache.cassandra.service.DigestMismatchException: Mismatch for key
 DecoratedKey(-1566729966326640413, 336b35356c49537731797a4a5f64627a797236)
 (b3dcfcbeed6676eae7ff88cc1bd251fb vs 6e7e9225871374d68a7cdb54ae70726d)
 
 at
 org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85)
 ~[apache-cassandra-3.5.0.jar:3.5.0]
 
 at
 

Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

2016-06-21 Thread Oskar Kjellin
Hmm, no way we can do that in prod :/

Sent from my iPhone

> On 21 juni 2016, at 18:50, Julien Anguenot  wrote:
> 
> See my comments on the issue: I had to truncate and reinsert data in
> these corrupted tables.
> 
> AFAIK, there is no evidence that UDTs are responsible of this bad behavior.
> 
>> On Tue, Jun 21, 2016 at 11:45 AM, Oskar Kjellin  
>> wrote:
>> Yea I saw that one. We're not using UDT in the affected tables tho.
>> 
>> Did you resolve it?
>> 
>> Sent from my iPhone
>> 
>>> On 21 juni 2016, at 18:27, Julien Anguenot  wrote:
>>> 
>>> I have experienced similar duplicate primary keys behavior with couple
>>> of tables after upgrading from 2.2.x to 3.0.x.
>>> 
>>> See comments on the Jira issue I opened at the time over there:
>>> https://issues.apache.org/jira/browse/CASSANDRA-11887
>>> 
>>> 
 On Tue, Jun 21, 2016 at 10:47 AM, Oskar Kjellin  
 wrote:
 Hi,
 
 We've done this upgrade in both dev and stage before and we did not see
 similar issues.
 After upgrading production today we have a lot issues tho.
 
 The main issue is that the Datastax client quite often does not get the 
 data
 (even though it's the same query). I see similar flakyness by simply 
 running
 cqlsh, although it does return it returns broken data.
 
 We are running a 3 node cluster with RF 3.
 
 I have this table
 
 CREATE TABLE keyspace.table (
 
 a text,
 
   b text,
 
   c text,
 
   d list,
 
   e text,
 
   f timestamp,
 
   g list,
 
   h timestamp,
 
   PRIMARY KEY (a, b, c)
 
 )
 
 
 Every other time I query (not exactly every other time, but random) I get:
 
 
 SELECT * from table where a = 'xxx' and b = 'xxx'
 
 a | b | c | d | e | f
 | g| h
 
 -+--+---+--++-+---+-
 
 xxx |  xxx | ccc | null |   null | 2089-11-30
 23:00:00.00+ | ['fff'] | 2014-12-31 23:00:00.00+
 
 xxx |  xxx |   ddd | null |
 null | 2099-01-01 00:00:00.00+ | ['fff'] | 2016-06-17
 13:29:36.00+
 
 
 Which is the expected output.
 
 
 But I also get:
 
 a | b | c | d | e | f
 | g| h
 
 -+--+---+--++-+---+-
 
 xxx |  xxx | ccc | null |   null |
 null |  null |null
 
 xxx |  xxx | ccc | null |   null | 2089-11-30
 23:00:00.00+ | ['fff'] |null
 
 xxx |  xxx | ccc | null |   null |
 null |  null | 2014-12-31 23:00:00.00+
 
 xxx |  xxx |   ddd | null |
 null |null |  null |
 null
 
 xxx |  xxx |   ddd | null |
 null | 2099-01-01 00:00:00.00+ | ['fff'] |
 null
 
 xxx |  xxx |   ddd | null |
 null |null |  null | 2016-06-17
 13:29:36.00+
 
 
 Notice that the same PK is returned 3 times. With different parts of the
 data. I believe this is what's currently killing our production 
 environment.
 
 
 I'm running upgradesstables as of this moment, but it's not finished yet. I
 started a repair before but nothing happened. The upgradesstables finished
 now on 2 out of 3 nodes, but production is still down :/
 
 
 We also see these in the logs, over and over again:
 
 DEBUG [ReadRepairStage:4] 2016-06-21 15:44:01,119 ReadCallback.java:235 -
 Digest mismatch:
 
 org.apache.cassandra.service.DigestMismatchException: Mismatch for key
 DecoratedKey(-1566729966326640413, 336b35356c49537731797a4a5f64627a797236)
 (b3dcfcbeed6676eae7ff88cc1bd251fb vs 6e7e9225871374d68a7cdb54ae70726d)
 
 at
 org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85)
 ~[apache-cassandra-3.5.0.jar:3.5.0]
 
 at
 org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:226)
 ~[apache-cassandra-3.5.0.jar:3.5.0]
 

Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

2016-06-21 Thread Julien Anguenot
See my comments on the issue: I had to truncate and reinsert data in
these corrupted tables.

AFAIK, there is no evidence that UDTs are responsible of this bad behavior.

On Tue, Jun 21, 2016 at 11:45 AM, Oskar Kjellin  wrote:
> Yea I saw that one. We're not using UDT in the affected tables tho.
>
> Did you resolve it?
>
> Sent from my iPhone
>
>> On 21 juni 2016, at 18:27, Julien Anguenot  wrote:
>>
>> I have experienced similar duplicate primary keys behavior with couple
>> of tables after upgrading from 2.2.x to 3.0.x.
>>
>> See comments on the Jira issue I opened at the time over there:
>> https://issues.apache.org/jira/browse/CASSANDRA-11887
>>
>>
>>> On Tue, Jun 21, 2016 at 10:47 AM, Oskar Kjellin  
>>> wrote:
>>> Hi,
>>>
>>> We've done this upgrade in both dev and stage before and we did not see
>>> similar issues.
>>> After upgrading production today we have a lot issues tho.
>>>
>>> The main issue is that the Datastax client quite often does not get the data
>>> (even though it's the same query). I see similar flakyness by simply running
>>> cqlsh, although it does return it returns broken data.
>>>
>>> We are running a 3 node cluster with RF 3.
>>>
>>> I have this table
>>>
>>> CREATE TABLE keyspace.table (
>>>
>>>  a text,
>>>
>>>b text,
>>>
>>>c text,
>>>
>>>d list,
>>>
>>>e text,
>>>
>>>f timestamp,
>>>
>>>g list,
>>>
>>>h timestamp,
>>>
>>>PRIMARY KEY (a, b, c)
>>>
>>> )
>>>
>>>
>>> Every other time I query (not exactly every other time, but random) I get:
>>>
>>>
>>> SELECT * from table where a = 'xxx' and b = 'xxx'
>>>
>>> a | b | c | d | e | f
>>> | g| h
>>>
>>> -+--+---+--++-+---+-
>>>
>>> xxx |  xxx | ccc | null |   null | 2089-11-30
>>> 23:00:00.00+ | ['fff'] | 2014-12-31 23:00:00.00+
>>>
>>> xxx |  xxx |   ddd | null |
>>> null | 2099-01-01 00:00:00.00+ | ['fff'] | 2016-06-17
>>> 13:29:36.00+
>>>
>>>
>>> Which is the expected output.
>>>
>>>
>>> But I also get:
>>>
>>> a | b | c | d | e | f
>>> | g| h
>>>
>>> -+--+---+--++-+---+-
>>>
>>> xxx |  xxx | ccc | null |   null |
>>> null |  null |null
>>>
>>> xxx |  xxx | ccc | null |   null | 2089-11-30
>>> 23:00:00.00+ | ['fff'] |null
>>>
>>> xxx |  xxx | ccc | null |   null |
>>> null |  null | 2014-12-31 23:00:00.00+
>>>
>>> xxx |  xxx |   ddd | null |
>>> null |null |  null |
>>> null
>>>
>>> xxx |  xxx |   ddd | null |
>>> null | 2099-01-01 00:00:00.00+ | ['fff'] |
>>> null
>>>
>>> xxx |  xxx |   ddd | null |
>>> null |null |  null | 2016-06-17
>>> 13:29:36.00+
>>>
>>>
>>> Notice that the same PK is returned 3 times. With different parts of the
>>> data. I believe this is what's currently killing our production environment.
>>>
>>>
>>> I'm running upgradesstables as of this moment, but it's not finished yet. I
>>> started a repair before but nothing happened. The upgradesstables finished
>>> now on 2 out of 3 nodes, but production is still down :/
>>>
>>>
>>> We also see these in the logs, over and over again:
>>>
>>> DEBUG [ReadRepairStage:4] 2016-06-21 15:44:01,119 ReadCallback.java:235 -
>>> Digest mismatch:
>>>
>>> org.apache.cassandra.service.DigestMismatchException: Mismatch for key
>>> DecoratedKey(-1566729966326640413, 336b35356c49537731797a4a5f64627a797236)
>>> (b3dcfcbeed6676eae7ff88cc1bd251fb vs 6e7e9225871374d68a7cdb54ae70726d)
>>>
>>> at
>>> org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85)
>>> ~[apache-cassandra-3.5.0.jar:3.5.0]
>>>
>>> at
>>> org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:226)
>>> ~[apache-cassandra-3.5.0.jar:3.5.0]
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> [na:1.8.0_72]
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> [na:1.8.0_72]
>>>
>>> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
>>>
>>>
>>> Any help is much appreciated
>>
>>
>>
>> --
>> 

Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

2016-06-21 Thread Oskar Kjellin
Yea I saw that one. We're not using UDT in the affected tables tho. 

Did you resolve it?

Sent from my iPhone

> On 21 juni 2016, at 18:27, Julien Anguenot  wrote:
> 
> I have experienced similar duplicate primary keys behavior with couple
> of tables after upgrading from 2.2.x to 3.0.x.
> 
> See comments on the Jira issue I opened at the time over there:
> https://issues.apache.org/jira/browse/CASSANDRA-11887
> 
> 
>> On Tue, Jun 21, 2016 at 10:47 AM, Oskar Kjellin  
>> wrote:
>> Hi,
>> 
>> We've done this upgrade in both dev and stage before and we did not see
>> similar issues.
>> After upgrading production today we have a lot issues tho.
>> 
>> The main issue is that the Datastax client quite often does not get the data
>> (even though it's the same query). I see similar flakyness by simply running
>> cqlsh, although it does return it returns broken data.
>> 
>> We are running a 3 node cluster with RF 3.
>> 
>> I have this table
>> 
>> CREATE TABLE keyspace.table (
>> 
>>  a text,
>> 
>>b text,
>> 
>>c text,
>> 
>>d list,
>> 
>>e text,
>> 
>>f timestamp,
>> 
>>g list,
>> 
>>h timestamp,
>> 
>>PRIMARY KEY (a, b, c)
>> 
>> )
>> 
>> 
>> Every other time I query (not exactly every other time, but random) I get:
>> 
>> 
>> SELECT * from table where a = 'xxx' and b = 'xxx'
>> 
>> a | b | c | d | e | f
>> | g| h
>> 
>> -+--+---+--++-+---+-
>> 
>> xxx |  xxx | ccc | null |   null | 2089-11-30
>> 23:00:00.00+ | ['fff'] | 2014-12-31 23:00:00.00+
>> 
>> xxx |  xxx |   ddd | null |
>> null | 2099-01-01 00:00:00.00+ | ['fff'] | 2016-06-17
>> 13:29:36.00+
>> 
>> 
>> Which is the expected output.
>> 
>> 
>> But I also get:
>> 
>> a | b | c | d | e | f
>> | g| h
>> 
>> -+--+---+--++-+---+-
>> 
>> xxx |  xxx | ccc | null |   null |
>> null |  null |null
>> 
>> xxx |  xxx | ccc | null |   null | 2089-11-30
>> 23:00:00.00+ | ['fff'] |null
>> 
>> xxx |  xxx | ccc | null |   null |
>> null |  null | 2014-12-31 23:00:00.00+
>> 
>> xxx |  xxx |   ddd | null |
>> null |null |  null |
>> null
>> 
>> xxx |  xxx |   ddd | null |
>> null | 2099-01-01 00:00:00.00+ | ['fff'] |
>> null
>> 
>> xxx |  xxx |   ddd | null |
>> null |null |  null | 2016-06-17
>> 13:29:36.00+
>> 
>> 
>> Notice that the same PK is returned 3 times. With different parts of the
>> data. I believe this is what's currently killing our production environment.
>> 
>> 
>> I'm running upgradesstables as of this moment, but it's not finished yet. I
>> started a repair before but nothing happened. The upgradesstables finished
>> now on 2 out of 3 nodes, but production is still down :/
>> 
>> 
>> We also see these in the logs, over and over again:
>> 
>> DEBUG [ReadRepairStage:4] 2016-06-21 15:44:01,119 ReadCallback.java:235 -
>> Digest mismatch:
>> 
>> org.apache.cassandra.service.DigestMismatchException: Mismatch for key
>> DecoratedKey(-1566729966326640413, 336b35356c49537731797a4a5f64627a797236)
>> (b3dcfcbeed6676eae7ff88cc1bd251fb vs 6e7e9225871374d68a7cdb54ae70726d)
>> 
>> at
>> org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85)
>> ~[apache-cassandra-3.5.0.jar:3.5.0]
>> 
>> at
>> org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:226)
>> ~[apache-cassandra-3.5.0.jar:3.5.0]
>> 
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> [na:1.8.0_72]
>> 
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> [na:1.8.0_72]
>> 
>> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
>> 
>> 
>> Any help is much appreciated
> 
> 
> 
> -- 
> Julien Anguenot (@anguenot)


Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

2016-06-21 Thread Julien Anguenot
I have experienced similar duplicate primary keys behavior with couple
of tables after upgrading from 2.2.x to 3.0.x.

See comments on the Jira issue I opened at the time over there:
https://issues.apache.org/jira/browse/CASSANDRA-11887


On Tue, Jun 21, 2016 at 10:47 AM, Oskar Kjellin  wrote:
> Hi,
>
> We've done this upgrade in both dev and stage before and we did not see
> similar issues.
> After upgrading production today we have a lot issues tho.
>
> The main issue is that the Datastax client quite often does not get the data
> (even though it's the same query). I see similar flakyness by simply running
> cqlsh, although it does return it returns broken data.
>
> We are running a 3 node cluster with RF 3.
>
> I have this table
>
> CREATE TABLE keyspace.table (
>
>   a text,
>
> b text,
>
> c text,
>
> d list,
>
> e text,
>
> f timestamp,
>
> g list,
>
> h timestamp,
>
> PRIMARY KEY (a, b, c)
>
> )
>
>
> Every other time I query (not exactly every other time, but random) I get:
>
>
> SELECT * from table where a = 'xxx' and b = 'xxx'
>
>  a | b | c | d | e | f
> | g| h
>
> -+--+---+--++-+---+-
>
>  xxx |  xxx | ccc | null |   null | 2089-11-30
> 23:00:00.00+ | ['fff'] | 2014-12-31 23:00:00.00+
>
>  xxx |  xxx |   ddd | null |
> null | 2099-01-01 00:00:00.00+ | ['fff'] | 2016-06-17
> 13:29:36.00+
>
>
> Which is the expected output.
>
>
> But I also get:
>
>  a | b | c | d | e | f
> | g| h
>
> -+--+---+--++-+---+-
>
>  xxx |  xxx | ccc | null |   null |
> null |  null |null
>
>  xxx |  xxx | ccc | null |   null | 2089-11-30
> 23:00:00.00+ | ['fff'] |null
>
>  xxx |  xxx | ccc | null |   null |
> null |  null | 2014-12-31 23:00:00.00+
>
>  xxx |  xxx |   ddd | null |
> null |null |  null |
> null
>
>  xxx |  xxx |   ddd | null |
> null | 2099-01-01 00:00:00.00+ | ['fff'] |
> null
>
>  xxx |  xxx |   ddd | null |
> null |null |  null | 2016-06-17
> 13:29:36.00+
>
>
> Notice that the same PK is returned 3 times. With different parts of the
> data. I believe this is what's currently killing our production environment.
>
>
> I'm running upgradesstables as of this moment, but it's not finished yet. I
> started a repair before but nothing happened. The upgradesstables finished
> now on 2 out of 3 nodes, but production is still down :/
>
>
> We also see these in the logs, over and over again:
>
> DEBUG [ReadRepairStage:4] 2016-06-21 15:44:01,119 ReadCallback.java:235 -
> Digest mismatch:
>
> org.apache.cassandra.service.DigestMismatchException: Mismatch for key
> DecoratedKey(-1566729966326640413, 336b35356c49537731797a4a5f64627a797236)
> (b3dcfcbeed6676eae7ff88cc1bd251fb vs 6e7e9225871374d68a7cdb54ae70726d)
>
> at
> org.apache.cassandra.service.DigestResolver.resolve(DigestResolver.java:85)
> ~[apache-cassandra-3.5.0.jar:3.5.0]
>
> at
> org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:226)
> ~[apache-cassandra-3.5.0.jar:3.5.0]
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [na:1.8.0_72]
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [na:1.8.0_72]
>
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
>
>
> Any help is much appreciated



-- 
Julien Anguenot (@anguenot)