So I figured out the main cause of the problem. The seed node was itself. That's what got it in a weird state. The second part was that I didn't know the default repair is incremental as I was accidently looking at the wrong version documentation. After running a repair -full, the 3 other nodes are synced correctly it seems as they have identical loads. Strangely, now the problem 10.128.0.20 node has 10 GB of load (the others have 6 GB). Since I now know I started it off in a very weird state, I'm going to just decommission it and add it back in from scratch. When I added it, all working folders were cleared.
I feel Cassandra should through an error if the seed node is set to itself and fail to bootstrap / join? On Wed, May 25, 2016 at 2:37 AM Mike Yeap <wkk1...@gmail.com> wrote: > Hi Luke, I've encountered similar problem before, could you please advise > on following? > > 1) when you add 10.128.0.20, what are the seeds defined in cassandra.yaml? > > 2) when you add 10.128.0.20, were the data and cache directories in > 10.128.0.20 empty? > > - /var/lib/cassandra/data > - /var/lib/cassandra/saved_caches > > 3) if you do a compact in 10.128.0.3, what is the size shown in "Load" > column in "nodetool status <keyspace_name>"? > > 4) when you do the full repair, did you use "nodetool repair" or "nodetool > repair -full"? I'm asking this because Incremental Repair is the default > for Cassandra 2.2 and later. > > > Regards, > Mike Yeap > > On Wed, May 25, 2016 at 8:01 AM, Bryan Cheng <br...@blockcypher.com> > wrote: > >> Hi Luke, >> >> I've never found nodetool status' load to be useful beyond a general >> indicator. >> >> You should expect some small skew, as this will depend on your current >> compaction status, tombstones, etc. IIRC repair will not provide >> consistency of intermediate states nor will it remove tombstones, it only >> guarantees consistency in the final state. This means, in the case of >> dropped hints or mutations, you will see differences in intermediate >> states, and therefore storage footrpint, even in fully repaired nodes. This >> includes intermediate UPDATE operations as well. >> >> Your one node with sub 1GB sticks out like a sore thumb, though. Where >> did you originate the nodetool repair from? Remember that repair will only >> ensure consistency for ranges held by the node you're running it on. While >> I am not sure if missing ranges are included in this, if you ran nodetool >> repair only on a machine with partial ownership, you will need to complete >> repairs across the ring before data will return to full consistency. >> >> I would query some older data using consistency = ONE on the affected >> machine to determine if you are actually missing data. There are a few >> outstanding bugs in the 2.1.x and older release families that may result >> in tombstone creation even without deletes, for example CASSANDRA-10547, >> which impacts updates on collections in pre-2.1.13 Cassandra. >> >> You can also try examining the output of nodetool ring, which will give >> you a breakdown of tokens and their associations within your cluster. >> >> --Bryan >> >> On Tue, May 24, 2016 at 3:49 PM, kurt Greaves <k...@instaclustr.com> >> wrote: >> >>> Not necessarily considering RF is 2 so both nodes should have all >>> partitions. Luke, are you sure the repair is succeeding? You don't have >>> other keyspaces/duplicate data/extra data in your cassandra data directory? >>> Also, you could try querying on the node with less data to confirm if it >>> has the same dataset. >>> >>> On 24 May 2016 at 22:03, Bhuvan Rawal <bhu1ra...@gmail.com> wrote: >>> >>>> For the other DC, it can be acceptable because partition reside on one >>>> node, so say if you have a large partition, it may skew things a bit. >>>> On May 25, 2016 2:41 AM, "Luke Jolly" <l...@getadmiral.com> wrote: >>>> >>>>> So I guess the problem may have been with the initial addition of the >>>>> 10.128.0.20 node because when I added it in it never synced data I >>>>> guess? It was at around 50 MB when it first came up and transitioned to >>>>> "UN". After it was in I did the 1->2 replication change and tried repair >>>>> but it didn't fix it. From what I can tell all the data on it is stuff >>>>> that has been written since it came up. We never delete data ever so we >>>>> should have zero tombstones. >>>>> >>>>> If I am not mistaken, only two of my nodes actually have all the data, >>>>> 10.128.0.3 and 10.142.0.14 since they agree on the data amount. >>>>> 10.142.0.13 >>>>> is almost a GB lower and then of course 10.128.0.20 which is missing >>>>> over 5 GB of data. I tried running nodetool -local on both DCs and it >>>>> didn't fix either one. >>>>> >>>>> Am I running into a bug of some kind? >>>>> >>>>> On Tue, May 24, 2016 at 4:06 PM Bhuvan Rawal <bhu1ra...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Luke, >>>>>> >>>>>> You mentioned that replication factor was increased from 1 to 2. In >>>>>> that case was the node bearing ip 10.128.0.20 carried around 3GB data >>>>>> earlier? >>>>>> >>>>>> You can run nodetool repair with option -local to initiate repair >>>>>> local datacenter for gce-us-central1. >>>>>> >>>>>> Also you may suspect that if a lot of data was deleted while the node >>>>>> was down it may be having a lot of tombstones which is not needed to be >>>>>> replicated to the other node. In order to verify the same, you can issue >>>>>> a >>>>>> select count(*) query on column families (With the amount of data you >>>>>> have >>>>>> it should not be an issue) with tracing on and with consistency local_all >>>>>> by connecting to either 10.128.0.3 or 10.128.0.20 and store it in a >>>>>> file. It will give you a fair amount of idea about how many deleted cells >>>>>> the nodes have. I tried searching for reference if tombstones are moved >>>>>> around during repair, but I didnt find evidence of it. However I see no >>>>>> reason to because if the node didnt have data then streaming tombstones >>>>>> does not make a lot of sense. >>>>>> >>>>>> Regards, >>>>>> Bhuvan >>>>>> >>>>>> On Tue, May 24, 2016 at 11:06 PM, Luke Jolly <l...@getadmiral.com> >>>>>> wrote: >>>>>> >>>>>>> Here's my setup: >>>>>>> >>>>>>> Datacenter: gce-us-central1 >>>>>>> =========================== >>>>>>> Status=Up/Down >>>>>>> |/ State=Normal/Leaving/Joining/Moving >>>>>>> -- Address Load Tokens Owns (effective) Host ID >>>>>>> Rack >>>>>>> UN 10.128.0.3 6.4 GB 256 100.0% >>>>>>> 3317a3de-9113-48e2-9a85-bbf756d7a4a6 default >>>>>>> UN 10.128.0.20 943.08 MB 256 100.0% >>>>>>> 958348cb-8205-4630-8b96-0951bf33f3d3 default >>>>>>> Datacenter: gce-us-east1 >>>>>>> ======================== >>>>>>> Status=Up/Down >>>>>>> |/ State=Normal/Leaving/Joining/Moving >>>>>>> -- Address Load Tokens Owns (effective) Host ID >>>>>>> Rack >>>>>>> UN 10.142.0.14 6.4 GB 256 100.0% >>>>>>> c3a5c39d-e1c9-4116-903d-b6d1b23fb652 default >>>>>>> UN 10.142.0.13 5.55 GB 256 100.0% >>>>>>> d0d9c30e-1506-4b95-be64-3dd4d78f0583 default >>>>>>> >>>>>>> And my replication settings are: >>>>>>> >>>>>>> {'class': 'NetworkTopologyStrategy', 'aws-us-west': '2', >>>>>>> 'gce-us-central1': '2', 'gce-us-east1': '2'} >>>>>>> >>>>>>> As you can see 10.128.0.20 in the gce-us-central1 DC only has a >>>>>>> load of 943 MB even though it's supposed to own 100% and should have 6.4 >>>>>>> GB. Also 10.142.0.13 seems also not to have everything as it only >>>>>>> has a load of 5.55 GB. >>>>>>> >>>>>>> On Mon, May 23, 2016 at 7:28 PM, kurt Greaves <k...@instaclustr.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Do you have 1 node in each DC or 2? If you're saying you have 1 >>>>>>>> node in each DC then a RF of 2 doesn't make sense. Can you clarify on >>>>>>>> what >>>>>>>> your set up is? >>>>>>>> >>>>>>>> On 23 May 2016 at 19:31, Luke Jolly <l...@getadmiral.com> wrote: >>>>>>>> >>>>>>>>> I am running 3.0.5 with 2 nodes in two DCs, gce-us-central1 and >>>>>>>>> gce-us-east1. I increased the replication factor of gce-us-central1 >>>>>>>>> from 1 >>>>>>>>> to 2. Then I ran 'nodetool repair -dc gce-us-central1'. The >>>>>>>>> "Owns" for the node switched to 100% as it should but the Load showed >>>>>>>>> that >>>>>>>>> it didn't actually sync the data. I then ran a full 'nodetool >>>>>>>>> repair' and >>>>>>>>> it didn't fix it still. This scares me as I thought 'nodetool >>>>>>>>> repair' was >>>>>>>>> a way to assure consistency and that all the nodes were synced but it >>>>>>>>> doesn't seem to be. Outside of that command, I have no idea how I >>>>>>>>> would >>>>>>>>> assure all the data was synced or how to get the data correctly synced >>>>>>>>> without decommissioning the node and re-adding it. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Kurt Greaves >>>>>>>> k...@instaclustr.com >>>>>>>> www.instaclustr.com >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>> >>> >>> -- >>> Kurt Greaves >>> k...@instaclustr.com >>> www.instaclustr.com >>> >> >> >