Re: Totally unbalanced cluster

Cogumelos Maravilha Fri, 05 May 2017 10:18:11 -0700

Hi,

Regarding the documentation I've already knew:


- thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
<http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html>
(From myself, how to handle tombstones)
- http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html (From
Alexander, a coworker @TLP - TWCS and expiring tables)

Anyway fantastic docs.

I desperately need to free up disk space. nodetool repair can do an
anticompaction.
In my case C* is only used to insert data that expires with a TTL of 18
days. No updates, or deletes. Some selects using the partition key.
gc_grace is defined to 3 hours.

Best practices do free up disk space please?

Thanks in advance



On 05/05/2017 03:09 PM, Alain RODRIGUEZ wrote:
> Hi,
>  
>
>     but it's so easy to add nodes
>
>
> Apache Cassandra has some kind of magic pieces ;-). Sometimes it is
> dark magic though :p. Yet adding a node is indeed not harder when
> using NetworkTopologyStrategy, as Jon mentioned above, once the
> configuration is done once properly.
>
>     Number of keys (estimate): 442779640
>
>     Number of keys (estimate): 736380940
>
>     Number of keys (estimate): 451097313
>
>
> This is indeed possibly, and most certainly creating imbalances. But
> also look at the partition size when using 'nodetool cfstats', using
> the previous information and the 'Compacted partition mean bytes', you
> should have an idea how much the disk space used is imbalanced. If you
> would like more details on the partition size distribution, partition
> size percentile are available using 'nodetool cfhistograms'.
>
> Regarding the global load (CPU, GC, disk IO, etc), it also depends on
> the workload (ie, what partitions are being read).
>
>     *Should I use nodetool rebuild?*
>
>
> No, I see no reason. This command, 'nodetool rebuild' is meant to be
> used when adding a new datacenter to the cluster. Which, by the way,
> will not happen as long as you are using the 'SimpleStrategy', that
> basically creates one big ring and consider all the nodes as being
> part of it, no matter their placement in the network, if I remember
> correctly.
>
>     The nodetool repair by default in this C* version is incremental
>     and since the repair is run in all nodes in different hours 
>
>
> Incremental repairs are quite new to me. But I heard they bring some
> issues, often due to anti-compactions inducing a high number of
> SSTables and a growing number of compactions pending. But it does not
> look bad in your case. 
>
> Yet the '-pr' option should not be used when doing incremental
> repairs. This thread mentions it and is probably worth
> reading: https://groups.google.com/forum/#!topic/nosql-databases/peTArLfhXMU
> <https://groups.google.com/forum/#%21topic/nosql-databases/peTArLfhXMU>.
> Also I believe it is mentioned in the video about repairs from
> Alexander I shared in my last mail.
>
>     and I don't want snapshots that's why I'm cleaning twice a day
>     (not sure that with -pr a snapshot is created).
>
>
> So the option using snapshots is not '-pr', but '-seq' (sequential) or
> '-par' (parallel), more
> info: 
> https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsRepair.html
>
> If you want to keep using sequential repairs, then you could check the
> snpashots automaticly generated names, and aim at deleting them
> specifically to prevent you from removing an other manually created
> and possibly important snapshot.
>
>     I'm using garbagecollect to force the cleanup since I'm running
>     out of space.
>
>
> Oh that's a whole topic. These blogposts should hopefully be helpful:
>
> - thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
> <http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html>
> (From myself, how to handle tombstones)
> - http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html (From
> Alexander, a coworker @TLP - TWCS and expiring tables)
>
> Hopefully some information picked from those 2 blog posts might help
> you freeing some disk space.
>
> It is probably not needed to use 'garbagecollect' as a routine
> operation. Some tuning in the compaction strategy or options (using
> defaults currently) might be enough to solve the issue.
>
> Yet the data is not correctly distributed. Something in the data model
> design is inducing it. The primary key (hashed) is what is used to
> affect the data to a specific node. Also a variable partition size can
> also lead to hotspots.
>
> As a side note, I strongly believe that understanding internals is
> very important to operate Apache Cassandra correctly, I mean playing
> with it to learn can put you in some undesirable situations. That's
> why I keep mentioning some blog posts, talks or documentations that I
> think could be helpful to know Apache Cassandra internals and
> processes a bit more.
>
> C*heers,
> -----------------------
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> <mailto:al...@thelastpickle.com>
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> 2017-05-04 17:10 GMT+01:00 Jon Haddad <jonathan.had...@gmail.com
> <mailto:jonathan.had...@gmail.com>>:
>
>     Adding nodes with NTS is easier, in my opinion.  You don’t need to
>     worry about replica placement, if you do it right.
>
>>     On May 4, 2017, at 7:43 AM, Cogumelos Maravilha
>>     <cogumelosmaravi...@sapo.pt <mailto:cogumelosmaravi...@sapo.pt>>
>>     wrote:
>>
>>     Hi Alain thanks for your kick reply.
>>
>>
>>     Regarding SimpleStrategy perhaps you are right but it's so easy
>>     to add nodes.
>>
>>     I'm *not* using vnodes and the default 256. The information that
>>     I've posted it a regular nodetool status keyspace.
>>
>>     My partition key is a sequencial big int but nodetool cfstatus
>>     shows that the number of keys are not balanced (data from 3 nodes):
>>
>>     Number of keys (estimate): 442779640
>>
>>     Number of keys (estimate): 736380940
>>
>>     Number of keys (estimate): 451097313
>>
>>     *Should I use nodetool rebuild?*
>>
>>     Running:
>>
>>     nodetool getendpoints mykeyspace data 9213395123941039285
>>
>>     10.1.1.52
>>     10.1.1.185
>>
>>     nodetool getendpoints mykeyspace data 9213395123941039286
>>
>>     10.1.1.161
>>     10.1.1.19
>>
>>     All nodes are working hard because my TTL is for 18 days and
>>     daily data ingestion is around 120,000,000 records:
>>
>>     nodetool compactionstats -H
>>     pending tasks: 3
>>     - mykeyspace.data: 3
>>
>>     id                                   compaction type    
>>     keyspace  table     completed  total      unit  progress
>>     c49599b1-308d-11e7-ba5b-67e232f1bee1 Remove deleted data
>>     mykeyspace data 133.89 GiB 158.33 GiB bytes 84.56% 
>>     c49599b0-308d-11e7-ba5b-67e232f1bee1 Remove deleted data
>>     mykeyspace data 136.2 GiB  278.96 GiB bytes 48.83%
>>
>>     Active compaction remaining time :   0h00m00s
>>
>>
>>     nodetool compactionstats -H
>>     pending tasks: 2
>>     - mykeyspace.data: 2
>>
>>     id                                   compaction type keyspace 
>>     table     completed total      unit  progress
>>     b6e8ce80-30d4-11e7-a2be-9b830f114108 Compaction      mykeyspace
>>     data 4.05 GiB  133.02 GiB bytes 3.04%  
>>     Active compaction remaining time :   2h17m34s
>>
>>
>>     The nodetool repair by default in this C* version is incremental
>>     and since the repair is run in all nodes in different hours and I
>>     don't want snapshots that's why I'm cleaning twice a day (not
>>     sure that with -pr a snapshot is created).
>>
>>     The cleanup was already remove was there because last node was
>>     created a few days ago.
>>
>>     I'm using garbagecollect to force the cleanup since I'm running
>>     out of space.
>>
>>
>>     Regards.
>>
>>
>>     On 05/04/2017 12:50 PM, Alain RODRIGUEZ wrote:
>>>     Hi,
>>>
>>>         CREATE KEYSPACE mykeyspace WITH replication = {'class':
>>>         'SimpleStrategy', 'replication_factor': '2'}  AND
>>>         durable_writes = false;
>>>
>>>
>>>     The SimpleStrategy is never recommended for production clusters
>>>     as it does not recognise racks or datacenter, inducing possible
>>>     availability issues and unpredictable latency when using those.
>>>     I would not even use it for testing purposes, I see no point in
>>>     most cases.
>>>
>>>     Even if this should be changed, carefully but as soon as
>>>     possible imho, it is probably not related to your main issue at
>>>     hand.
>>>
>>>     If nodes are imbalanced, there are 3 mains questions that come
>>>     to my mind:
>>>
>>>      1. Are the token well distributed among the available nodes?
>>>      2. Is the data correctly balanced on the token ring (i.e. are
>>>         the 'id' values of 'mykeyspace.data' table well spread
>>>         between the nodes?
>>>      3. Are the compaction processes running smoothly on every nodes
>>>
>>>
>>>     *Point 1* depends on whether you are using vnodes or not and
>>>     what number of vnodes ('num_token' in cassandra.yaml).
>>>
>>>       * If not using vnodes, you have to manually set the positions
>>>         of the nodes and move them around when adding more nodes so
>>>         thing remain balanced
>>>       * If using vnodes, make sure to use a high enough number of
>>>         vnodes so distribution is 'good enough' (More than 32 in
>>>         most cases, default is 256, which lead to quite balanced
>>>         rings, but brings other issues)
>>>
>>>
>>>         UN  10.1.1.161  398.39 GiB  256          28.9%
>>>         UN  10.1.1.19   765.32 GiB  256          29.9%
>>>         UN  10.1.1.52   574.24 GiB  256          28.2%
>>>         UN  10.1.1.213  817.56 GiB  256          28.2%
>>>         UN  10.1.1.85   638.82 GiB  256          28.2%
>>>         UN  10.1.1.245  408.95 GiB  256          28.7%
>>>         UN  10.1.1.185  574.63 GiB  256          27.9%
>>>
>>>
>>>     You can have the token ownership information by running
>>>     'nodetool status <mykeyspace>'. Adding the keyspace name in the
>>>     command give you the real ownership. Also, RF = 2 means the
>>>     total of the ownership should be 200%, ideally evenly balanced.
>>>     I am not sure about the command you ran here. Also as a global
>>>     advice, let us the command you ran and what you expect us to see
>>>     in the output.
>>>
>>>     Still the tokens seems to be well distributed, and I guess you
>>>     are using the default 'num_token': 256. So I believe you are not
>>>     having this issue. But the delta between the data hold on each
>>>     node is up to x2 (400 GB on some nodes, 800 GB on some others).
>>>
>>>     *Point 2* highly depends on the workload. Are your partitions
>>>     evenly distributed among the nodes? It depends on your primary
>>>     key. Using an UUID as the partition key is often a good idea,
>>>     but it depends on your needs as well, of course. You could look
>>>     at the distribution on the distinct nodes through: 'nodetool
>>>     cfstats'.
>>>     *
>>>     *
>>>     *Point 3* : even if the tokens are perfectly distributed and the
>>>     primary key perfectly randomized, some node can have some disk
>>>     issue or any other reason having the compactions falling behind.
>>>     This would lead to this node to hold more data and note evicting
>>>     tombstones properly in some cases, increasing disk space used.
>>>     Other than that, you can have a big SSTable being compacted on a
>>>     node, having the size of the node growing quite suddenly (that's
>>>     why 50 to 20% of the disk should always be free, depending on
>>>     the compaction strategy in use and the number of concurrent
>>>     compactions). Here, running 'nodetool compactionstats -H' on all
>>>     the nodes would probably help you to troubleshoot.
>>>     *
>>>     *
>>>     *About crontab*
>>>      
>>>
>>>         08 05   * * *   root    nodetool repair -pr
>>>         11 11   * * *   root    fstrim -a
>>>         04 12   * * *   root    nodetool clearsnapshot
>>>         33 13   * * 2   root    nodetool cleanup
>>>         35 15   * * *   root    nodetool garbagecollect
>>>         46 19   * * *   root    nodetool clearsnapshot
>>>         50 23   * * *   root    nodetool flush*
>>>         *
>>>
>>>     *
>>>     *
>>>     I don't understand what you try to achieve with some of the
>>>     commands:
>>>
>>>         nodetool repair -pr
>>>
>>>
>>>     Repairing the cluster regularly is good in most cases, but as
>>>     default changes with version, I would specify if the repair is
>>>     supposed to be 'incremental' or 'full', if it is supposed to be
>>>     'sequential' or 'parallel' for example. Also, as the dataset
>>>     growth, some issue will appear with repairs.Just search for
>>>     'repairs cassandra' on google or any search engine you are using
>>>     and you will see that repair is a complex topic. Look for videos
>>>     and you will find a lot of informations about it from nice talks
>>>     like these 2 from the last summit:
>>>
>>>     https://www.youtube.com/watch?v=FrF8wQuXXks
>>>     <https://www.youtube.com/watch?v=FrF8wQuXXks>
>>>     https://www.youtube.com/watch?v=1Sz_K8UID6E
>>>     <https://www.youtube.com/watch?v=1Sz_K8UID6E>
>>>
>>>     Also some nice tools exist to help with repairs:
>>>
>>>     The Reaper (originally made at Spotify now maintained by The
>>>     Last Pickle): https://github.com/thelastpickle/cassandra-reaper
>>>     <https://github.com/thelastpickle/cassandra-reaper>
>>>     'cassandra_range_repair.py':
>>>      https://github.com/BrianGallew/cassandra_range_repair
>>>     <https://github.com/BrianGallew/cassandra_range_repair>
>>>
>>>         11 11   * * *   root    fstrim -a
>>>
>>>
>>>     I am not really sure about this one but it looks good as long as
>>>     the 'fstrim' do not create performance issue while it is running
>>>     it seems fine.
>>>
>>>         04 12   * * *   root    nodetool clearsnapshot
>>>
>>>
>>>     This will automatically erase any snapshot you might want to
>>>     keep. It might be good to specify what snapshot you want to
>>>     remove and name it. Some snapshots will be created and not
>>>     removed when using a sequential repair. So I believe clearing
>>>     specific snapshots is a good idea to save disk space.
>>>
>>>         33 13   * * 2   root    nodetool cleanup
>>>
>>>
>>>     This is to be ran on all the nodes after adding a new node. It
>>>     will just remove data from existing node that 'gave' some token
>>>     ranges to the new node. To do so it will compact all the
>>>     SSTables. It doesn't seem to be a good idea to 'cron' that.
>>>
>>>         35 15   * * *   root    nodetool garbagecollect
>>>
>>>
>>>     This is also an heavy operation that you should not need in a
>>>     regular basis:
>>>     
>>> http://cassandra.apache.org/doc/latest/tools/nodetool/garbagecollect.html
>>>     
>>> <http://cassandra.apache.org/doc/latest/tools/nodetool/garbagecollect.html>.
>>>     What problem are you trying to solve here? Your data uses TTLs
>>>     and TWCS, so expired SSTable should be going away without any issue.
>>>
>>>         46 19   * * *   root    nodetool clearsnapshot
>>>
>>>
>>>     Again? What for?
>>>
>>>         50 23   * * *   root    nodetool flush
>>>
>>>
>>>     This will produce tables to be flushed at the same time, no
>>>     matter their sizes or any other considerations. It is not to be
>>>     used unless you are doing some testing, debugging or on your way
>>>     to shut down the node.
>>>
>>>     C*heers,
>>>     -----------------------
>>>     Alain Rodriguez - @arodream - al...@thelastpickle.com
>>>     <mailto:al...@thelastpickle.com>
>>>     France
>>>
>>>     The Last Pickle - Apache Cassandra Consulting
>>>     http://www.thelastpickle.com <http://www.thelastpickle.com/>
>>>
>>>     2017-05-04 11:38 GMT+01:00 Cogumelos Maravilha
>>>     <cogumelosmaravi...@sapo.pt <mailto:cogumelosmaravi...@sapo.pt>>:
>>>
>>>         Hi all,
>>>
>>>         I'm using C* 3.10.
>>>
>>>         CREATE KEYSPACE mykeyspace WITH replication = {'class':
>>>         'SimpleStrategy', 'replication_factor': '2'}  AND
>>>         durable_writes = false;
>>>
>>>         CREATE TABLE mykeyspace.data (
>>>             id bigint PRIMARY KEY,
>>>             kafka text
>>>         ) WITH bloom_filter_fp_chance = 0.5
>>>             AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>>>             AND comment = ''
>>>             AND compaction = {'class':
>>>         'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy',
>>>         'compaction_window_size': '10', 'compaction_window_unit':
>>>         'HOURS',
>>>         'max_threshold': '32', 'min_threshold': '6'}
>>>             AND compression = {'chunk_length_in_kb': '64', 'class':
>>>         'org.apache.cassandra.io
>>>         <http://org.apache.cassandra.io/>.compress.LZ4Compressor'}
>>>             AND crc_check_chance = 0.0
>>>             AND dclocal_read_repair_chance = 0.1
>>>             AND default_time_to_live = 1555200
>>>             AND gc_grace_seconds = 10800
>>>             AND max_index_interval = 2048
>>>             AND memtable_flush_period_in_ms = 0
>>>             AND min_index_interval = 128
>>>             AND read_repair_chance = 0.0
>>>             AND speculative_retry = '99PERCENTILE';
>>>
>>>         UN  10.1.1.161  398.39 GiB  256          28.9%
>>>         UN  10.1.1.19   765.32 GiB  256          29.9%
>>>         UN  10.1.1.52   574.24 GiB  256          28.2%
>>>         UN  10.1.1.213  817.56 GiB  256          28.2%
>>>         UN  10.1.1.85   638.82 GiB  256          28.2%
>>>         UN  10.1.1.245  408.95 GiB  256          28.7%
>>>         UN  10.1.1.185  574.63 GiB  256          27.9%
>>>
>>>         At crontab in all nodes (only changes the time):
>>>
>>>         08 05   * * *   root    nodetool repair -pr
>>>         11 11   * * *   root    fstrim -a
>>>         04 12   * * *   root    nodetool clearsnapshot
>>>         33 13   * * 2   root    nodetool cleanup
>>>         35 15   * * *   root    nodetool garbagecollect
>>>         46 19   * * *   root    nodetool clearsnapshot
>>>         50 23   * * *   root    nodetool flush
>>>
>>>         I can I fixed this?
>>>
>>>         Thanks in advance.
>>>
>>>
>>>
>>>         
>>> ---------------------------------------------------------------------
>>>         To unsubscribe, e-mail:
>>>         user-unsubscr...@cassandra.apache.org
>>>         <mailto:user-unsubscr...@cassandra.apache.org>
>>>         For additional commands, e-mail:
>>>         user-h...@cassandra.apache.org
>>>         <mailto:user-h...@cassandra.apache.org>
>>>
>>>
>>
>
>

Re: Totally unbalanced cluster

Reply via email to