Re: Totally unbalanced cluster

Alain RODRIGUEZ Thu, 04 May 2017 04:51:14 -0700

Hi,

CREATE KEYSPACE mykeyspace WITH replication = {'class':
> 'SimpleStrategy', 'replication_factor': '2'}  AND durable_writes = false;



The SimpleStrategy is never recommended for production clusters as it does
not recognise racks or datacenter, inducing possible availability issues
and unpredictable latency when using those. I would not even use it for
testing purposes, I see no point in most cases.

Even if this should be changed, carefully but as soon as possible imho, it
is probably not related to your main issue at hand.

If nodes are imbalanced, there are 3 mains questions that come to my mind:


   1. Are the token well distributed among the available nodes?
   2. Is the data correctly balanced on the token ring (i.e. are the 'id'
   values of 'mykeyspace.data' table well spread between the nodes?
   3. Are the compaction processes running smoothly on every nodes


*Point 1* depends on whether you are using vnodes or not and what number of
vnodes ('num_token' in cassandra.yaml).

   - If not using vnodes, you have to manually set the positions of the
   nodes and move them around when adding more nodes so thing remain balanced
   - If using vnodes, make sure to use a high enough number of vnodes so
   distribution is 'good enough' (More than 32 in most cases, default is 256,
   which lead to quite balanced rings, but brings other issues)


UN  10.1.1.161  398.39 GiB  256          28.9%
> UN  10.1.1.19   765.32 GiB  256          29.9%
> UN  10.1.1.52   574.24 GiB  256          28.2%
> UN  10.1.1.213  817.56 GiB  256          28.2%
> UN  10.1.1.85   638.82 GiB  256          28.2%
> UN  10.1.1.245  408.95 GiB  256          28.7%
> UN  10.1.1.185  574.63 GiB  256          27.9%


You can have the token ownership information by running 'nodetool status
<mykeyspace>'. Adding the keyspace name in the command give you the real
ownership. Also, RF = 2 means the total of the ownership should be 200%,
ideally evenly balanced. I am not sure about the command you ran here. Also
as a global advice, let us the command you ran and what you expect us to
see in the output.

Still the tokens seems to be well distributed, and I guess you are using
the default 'num_token': 256. So I believe you are not having this issue.
But the delta between the data hold on each node is up to x2 (400 GB on
some nodes, 800 GB on some others).

*Point 2* highly depends on the workload. Are your partitions evenly
distributed among the nodes? It depends on your primary key. Using an UUID
as the partition key is often a good idea, but it depends on your needs as
well, of course. You could look at the distribution on the distinct nodes
through: 'nodetool cfstats'.

*Point 3* : even if the tokens are perfectly distributed and the primary
key perfectly randomized, some node can have some disk issue or any other
reason having the compactions falling behind. This would lead to this node
to hold more data and note evicting tombstones properly in some cases,
increasing disk space used. Other than that, you can have a big SSTable
being compacted on a node, having the size of the node growing quite
suddenly (that's why 50 to 20% of the disk should always be free, depending
on the compaction strategy in use and the number of concurrent
compactions). Here, running 'nodetool compactionstats -H' on all the nodes
would probably help you to troubleshoot.

*About crontab*


> 08 05   * * *   root    nodetool repair -pr
> 11 11   * * *   root    fstrim -a
> 04 12   * * *   root    nodetool clearsnapshot
> 33 13   * * 2   root    nodetool cleanup
> 35 15   * * *   root    nodetool garbagecollect
> 46 19   * * *   root    nodetool clearsnapshot
> 50 23   * * *   root    nodetool flush
>

I don't understand what you try to achieve with some of the commands:

nodetool repair -pr


Repairing the cluster regularly is good in most cases, but as default
changes with version, I would specify if the repair is supposed to be
'incremental' or 'full', if it is supposed to be 'sequential' or 'parallel'
for example. Also, as the dataset growth, some issue will appear with
repairs.Just search for 'repairs cassandra' on google or any search engine
you are using and you will see that repair is a complex topic. Look for
videos and you will find a lot of informations about it from nice talks
like these 2 from the last summit:

https://www.youtube.com/watch?v=FrF8wQuXXks
https://www.youtube.com/watch?v=1Sz_K8UID6E

Also some nice tools exist to help with repairs:

The Reaper (originally made at Spotify now maintained by The Last Pickle):
https://github.com/thelastpickle/cassandra-reaper
'cassandra_range_repair.py':  https://github.com/
BrianGallew/cassandra_range_repair

11 11   * * *   root    fstrim -a


I am not really sure about this one but it looks good as long as the
'fstrim' do not create performance issue while it is running it seems fine.

04 12   * * *   root    nodetool clearsnapshot


This will automatically erase any snapshot you might want to keep. It might
be good to specify what snapshot you want to remove and name it. Some
snapshots will be created and not removed when using a sequential repair.
So I believe clearing specific snapshots is a good idea to save disk space.

33 13   * * 2   root    nodetool cleanup


This is to be ran on all the nodes after adding a new node. It will just
remove data from existing node that 'gave' some token ranges to the new
node. To do so it will compact all the SSTables. It doesn't seem to be a
good idea to 'cron' that.

35 15   * * *   root    nodetool garbagecollect


This is also an heavy operation that you should not need in a regular
basis:
http://cassandra.apache.org/doc/latest/tools/nodetool/garbagecollect.html.
What problem are you trying to solve here? Your data uses TTLs and TWCS, so
expired SSTable should be going away without any issue.

46 19   * * *   root    nodetool clearsnapshot


Again? What for?

50 23   * * *   root    nodetool flush


This will produce tables to be flushed at the same time, no matter their
sizes or any other considerations. It is not to be used unless you are
doing some testing, debugging or on your way to shut down the node.

C*heers,
-----------------------
Alain Rodriguez - @arodream - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2017-05-04 11:38 GMT+01:00 Cogumelos Maravilha <cogumelosmaravi...@sapo.pt>:

> Hi all,
>
> I'm using C* 3.10.
>
> CREATE KEYSPACE mykeyspace WITH replication = {'class':
> 'SimpleStrategy', 'replication_factor': '2'}  AND durable_writes = false;
>
> CREATE TABLE mykeyspace.data (
>     id bigint PRIMARY KEY,
>     kafka text
> ) WITH bloom_filter_fp_chance = 0.5
>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>     AND comment = ''
>     AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy',
> 'compaction_window_size': '10', 'compaction_window_unit': 'HOURS',
> 'max_threshold': '32', 'min_threshold': '6'}
>     AND compression = {'chunk_length_in_kb': '64', 'class':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND crc_check_chance = 0.0
>     AND dclocal_read_repair_chance = 0.1
>     AND default_time_to_live = 1555200
>     AND gc_grace_seconds = 10800
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99PERCENTILE';
>
> UN  10.1.1.161  398.39 GiB  256          28.9%
> UN  10.1.1.19   765.32 GiB  256          29.9%
> UN  10.1.1.52   574.24 GiB  256          28.2%
> UN  10.1.1.213  817.56 GiB  256          28.2%
> UN  10.1.1.85   638.82 GiB  256          28.2%
> UN  10.1.1.245  408.95 GiB  256          28.7%
> UN  10.1.1.185  574.63 GiB  256          27.9%
>
> At crontab in all nodes (only changes the time):
>
> 08 05   * * *   root    nodetool repair -pr
> 11 11   * * *   root    fstrim -a
> 04 12   * * *   root    nodetool clearsnapshot
> 33 13   * * 2   root    nodetool cleanup
> 35 15   * * *   root    nodetool garbagecollect
> 46 19   * * *   root    nodetool clearsnapshot
> 50 23   * * *   root    nodetool flush
>
> I can I fixed this?
>
> Thanks in advance.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Totally unbalanced cluster

Reply via email to