Hi, Regarding the documentation I've already knew:
- thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html <http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html> (From myself, how to handle tombstones) - http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html (From Alexander, a coworker @TLP - TWCS and expiring tables) Anyway fantastic docs. I desperately need to free up disk space. nodetool repair can do an anticompaction. In my case C* is only used to insert data that expires with a TTL of 18 days. No updates, or deletes. Some selects using the partition key. gc_grace is defined to 3 hours. Best practices do free up disk space please? Thanks in advance On 05/05/2017 03:09 PM, Alain RODRIGUEZ wrote: > Hi, > > > but it's so easy to add nodes > > > Apache Cassandra has some kind of magic pieces ;-). Sometimes it is > dark magic though :p. Yet adding a node is indeed not harder when > using NetworkTopologyStrategy, as Jon mentioned above, once the > configuration is done once properly. > > Number of keys (estimate): 442779640 > > Number of keys (estimate): 736380940 > > Number of keys (estimate): 451097313 > > > This is indeed possibly, and most certainly creating imbalances. But > also look at the partition size when using 'nodetool cfstats', using > the previous information and the 'Compacted partition mean bytes', you > should have an idea how much the disk space used is imbalanced. If you > would like more details on the partition size distribution, partition > size percentile are available using 'nodetool cfhistograms'. > > Regarding the global load (CPU, GC, disk IO, etc), it also depends on > the workload (ie, what partitions are being read). > > *Should I use nodetool rebuild?* > > > No, I see no reason. This command, 'nodetool rebuild' is meant to be > used when adding a new datacenter to the cluster. Which, by the way, > will not happen as long as you are using the 'SimpleStrategy', that > basically creates one big ring and consider all the nodes as being > part of it, no matter their placement in the network, if I remember > correctly. > > The nodetool repair by default in this C* version is incremental > and since the repair is run in all nodes in different hours > > > Incremental repairs are quite new to me. But I heard they bring some > issues, often due to anti-compactions inducing a high number of > SSTables and a growing number of compactions pending. But it does not > look bad in your case. > > Yet the '-pr' option should not be used when doing incremental > repairs. This thread mentions it and is probably worth > reading: https://groups.google.com/forum/#!topic/nosql-databases/peTArLfhXMU > <https://groups.google.com/forum/#%21topic/nosql-databases/peTArLfhXMU>. > Also I believe it is mentioned in the video about repairs from > Alexander I shared in my last mail. > > and I don't want snapshots that's why I'm cleaning twice a day > (not sure that with -pr a snapshot is created). > > > So the option using snapshots is not '-pr', but '-seq' (sequential) or > '-par' (parallel), more > info: > https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsRepair.html > > If you want to keep using sequential repairs, then you could check the > snpashots automaticly generated names, and aim at deleting them > specifically to prevent you from removing an other manually created > and possibly important snapshot. > > I'm using garbagecollect to force the cleanup since I'm running > out of space. > > > Oh that's a whole topic. These blogposts should hopefully be helpful: > > - thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html > <http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html> > (From myself, how to handle tombstones) > - http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html (From > Alexander, a coworker @TLP - TWCS and expiring tables) > > Hopefully some information picked from those 2 blog posts might help > you freeing some disk space. > > It is probably not needed to use 'garbagecollect' as a routine > operation. Some tuning in the compaction strategy or options (using > defaults currently) might be enough to solve the issue. > > Yet the data is not correctly distributed. Something in the data model > design is inducing it. The primary key (hashed) is what is used to > affect the data to a specific node. Also a variable partition size can > also lead to hotspots. > > As a side note, I strongly believe that understanding internals is > very important to operate Apache Cassandra correctly, I mean playing > with it to learn can put you in some undesirable situations. That's > why I keep mentioning some blog posts, talks or documentations that I > think could be helpful to know Apache Cassandra internals and > processes a bit more. > > C*heers, > ----------------------- > Alain Rodriguez - @arodream - al...@thelastpickle.com > <mailto:al...@thelastpickle.com> > France > > The Last Pickle - Apache Cassandra Consulting > http://www.thelastpickle.com > > 2017-05-04 17:10 GMT+01:00 Jon Haddad <jonathan.had...@gmail.com > <mailto:jonathan.had...@gmail.com>>: > > Adding nodes with NTS is easier, in my opinion. You don’t need to > worry about replica placement, if you do it right. > >> On May 4, 2017, at 7:43 AM, Cogumelos Maravilha >> <cogumelosmaravi...@sapo.pt <mailto:cogumelosmaravi...@sapo.pt>> >> wrote: >> >> Hi Alain thanks for your kick reply. >> >> >> Regarding SimpleStrategy perhaps you are right but it's so easy >> to add nodes. >> >> I'm *not* using vnodes and the default 256. The information that >> I've posted it a regular nodetool status keyspace. >> >> My partition key is a sequencial big int but nodetool cfstatus >> shows that the number of keys are not balanced (data from 3 nodes): >> >> Number of keys (estimate): 442779640 >> >> Number of keys (estimate): 736380940 >> >> Number of keys (estimate): 451097313 >> >> *Should I use nodetool rebuild?* >> >> Running: >> >> nodetool getendpoints mykeyspace data 9213395123941039285 >> >> 10.1.1.52 >> 10.1.1.185 >> >> nodetool getendpoints mykeyspace data 9213395123941039286 >> >> 10.1.1.161 >> 10.1.1.19 >> >> All nodes are working hard because my TTL is for 18 days and >> daily data ingestion is around 120,000,000 records: >> >> nodetool compactionstats -H >> pending tasks: 3 >> - mykeyspace.data: 3 >> >> id compaction type >> keyspace table completed total unit progress >> c49599b1-308d-11e7-ba5b-67e232f1bee1 Remove deleted data >> mykeyspace data 133.89 GiB 158.33 GiB bytes 84.56% >> c49599b0-308d-11e7-ba5b-67e232f1bee1 Remove deleted data >> mykeyspace data 136.2 GiB 278.96 GiB bytes 48.83% >> >> Active compaction remaining time : 0h00m00s >> >> >> nodetool compactionstats -H >> pending tasks: 2 >> - mykeyspace.data: 2 >> >> id compaction type keyspace >> table completed total unit progress >> b6e8ce80-30d4-11e7-a2be-9b830f114108 Compaction mykeyspace >> data 4.05 GiB 133.02 GiB bytes 3.04% >> Active compaction remaining time : 2h17m34s >> >> >> The nodetool repair by default in this C* version is incremental >> and since the repair is run in all nodes in different hours and I >> don't want snapshots that's why I'm cleaning twice a day (not >> sure that with -pr a snapshot is created). >> >> The cleanup was already remove was there because last node was >> created a few days ago. >> >> I'm using garbagecollect to force the cleanup since I'm running >> out of space. >> >> >> Regards. >> >> >> On 05/04/2017 12:50 PM, Alain RODRIGUEZ wrote: >>> Hi, >>> >>> CREATE KEYSPACE mykeyspace WITH replication = {'class': >>> 'SimpleStrategy', 'replication_factor': '2'} AND >>> durable_writes = false; >>> >>> >>> The SimpleStrategy is never recommended for production clusters >>> as it does not recognise racks or datacenter, inducing possible >>> availability issues and unpredictable latency when using those. >>> I would not even use it for testing purposes, I see no point in >>> most cases. >>> >>> Even if this should be changed, carefully but as soon as >>> possible imho, it is probably not related to your main issue at >>> hand. >>> >>> If nodes are imbalanced, there are 3 mains questions that come >>> to my mind: >>> >>> 1. Are the token well distributed among the available nodes? >>> 2. Is the data correctly balanced on the token ring (i.e. are >>> the 'id' values of 'mykeyspace.data' table well spread >>> between the nodes? >>> 3. Are the compaction processes running smoothly on every nodes >>> >>> >>> *Point 1* depends on whether you are using vnodes or not and >>> what number of vnodes ('num_token' in cassandra.yaml). >>> >>> * If not using vnodes, you have to manually set the positions >>> of the nodes and move them around when adding more nodes so >>> thing remain balanced >>> * If using vnodes, make sure to use a high enough number of >>> vnodes so distribution is 'good enough' (More than 32 in >>> most cases, default is 256, which lead to quite balanced >>> rings, but brings other issues) >>> >>> >>> UN 10.1.1.161 398.39 GiB 256 28.9% >>> UN 10.1.1.19 765.32 GiB 256 29.9% >>> UN 10.1.1.52 574.24 GiB 256 28.2% >>> UN 10.1.1.213 817.56 GiB 256 28.2% >>> UN 10.1.1.85 638.82 GiB 256 28.2% >>> UN 10.1.1.245 408.95 GiB 256 28.7% >>> UN 10.1.1.185 574.63 GiB 256 27.9% >>> >>> >>> You can have the token ownership information by running >>> 'nodetool status <mykeyspace>'. Adding the keyspace name in the >>> command give you the real ownership. Also, RF = 2 means the >>> total of the ownership should be 200%, ideally evenly balanced. >>> I am not sure about the command you ran here. Also as a global >>> advice, let us the command you ran and what you expect us to see >>> in the output. >>> >>> Still the tokens seems to be well distributed, and I guess you >>> are using the default 'num_token': 256. So I believe you are not >>> having this issue. But the delta between the data hold on each >>> node is up to x2 (400 GB on some nodes, 800 GB on some others). >>> >>> *Point 2* highly depends on the workload. Are your partitions >>> evenly distributed among the nodes? It depends on your primary >>> key. Using an UUID as the partition key is often a good idea, >>> but it depends on your needs as well, of course. You could look >>> at the distribution on the distinct nodes through: 'nodetool >>> cfstats'. >>> * >>> * >>> *Point 3* : even if the tokens are perfectly distributed and the >>> primary key perfectly randomized, some node can have some disk >>> issue or any other reason having the compactions falling behind. >>> This would lead to this node to hold more data and note evicting >>> tombstones properly in some cases, increasing disk space used. >>> Other than that, you can have a big SSTable being compacted on a >>> node, having the size of the node growing quite suddenly (that's >>> why 50 to 20% of the disk should always be free, depending on >>> the compaction strategy in use and the number of concurrent >>> compactions). Here, running 'nodetool compactionstats -H' on all >>> the nodes would probably help you to troubleshoot. >>> * >>> * >>> *About crontab* >>> >>> >>> 08 05 * * * root nodetool repair -pr >>> 11 11 * * * root fstrim -a >>> 04 12 * * * root nodetool clearsnapshot >>> 33 13 * * 2 root nodetool cleanup >>> 35 15 * * * root nodetool garbagecollect >>> 46 19 * * * root nodetool clearsnapshot >>> 50 23 * * * root nodetool flush* >>> * >>> >>> * >>> * >>> I don't understand what you try to achieve with some of the >>> commands: >>> >>> nodetool repair -pr >>> >>> >>> Repairing the cluster regularly is good in most cases, but as >>> default changes with version, I would specify if the repair is >>> supposed to be 'incremental' or 'full', if it is supposed to be >>> 'sequential' or 'parallel' for example. Also, as the dataset >>> growth, some issue will appear with repairs.Just search for >>> 'repairs cassandra' on google or any search engine you are using >>> and you will see that repair is a complex topic. Look for videos >>> and you will find a lot of informations about it from nice talks >>> like these 2 from the last summit: >>> >>> https://www.youtube.com/watch?v=FrF8wQuXXks >>> <https://www.youtube.com/watch?v=FrF8wQuXXks> >>> https://www.youtube.com/watch?v=1Sz_K8UID6E >>> <https://www.youtube.com/watch?v=1Sz_K8UID6E> >>> >>> Also some nice tools exist to help with repairs: >>> >>> The Reaper (originally made at Spotify now maintained by The >>> Last Pickle): https://github.com/thelastpickle/cassandra-reaper >>> <https://github.com/thelastpickle/cassandra-reaper> >>> 'cassandra_range_repair.py': >>> https://github.com/BrianGallew/cassandra_range_repair >>> <https://github.com/BrianGallew/cassandra_range_repair> >>> >>> 11 11 * * * root fstrim -a >>> >>> >>> I am not really sure about this one but it looks good as long as >>> the 'fstrim' do not create performance issue while it is running >>> it seems fine. >>> >>> 04 12 * * * root nodetool clearsnapshot >>> >>> >>> This will automatically erase any snapshot you might want to >>> keep. It might be good to specify what snapshot you want to >>> remove and name it. Some snapshots will be created and not >>> removed when using a sequential repair. So I believe clearing >>> specific snapshots is a good idea to save disk space. >>> >>> 33 13 * * 2 root nodetool cleanup >>> >>> >>> This is to be ran on all the nodes after adding a new node. It >>> will just remove data from existing node that 'gave' some token >>> ranges to the new node. To do so it will compact all the >>> SSTables. It doesn't seem to be a good idea to 'cron' that. >>> >>> 35 15 * * * root nodetool garbagecollect >>> >>> >>> This is also an heavy operation that you should not need in a >>> regular basis: >>> >>> http://cassandra.apache.org/doc/latest/tools/nodetool/garbagecollect.html >>> >>> <http://cassandra.apache.org/doc/latest/tools/nodetool/garbagecollect.html>. >>> What problem are you trying to solve here? Your data uses TTLs >>> and TWCS, so expired SSTable should be going away without any issue. >>> >>> 46 19 * * * root nodetool clearsnapshot >>> >>> >>> Again? What for? >>> >>> 50 23 * * * root nodetool flush >>> >>> >>> This will produce tables to be flushed at the same time, no >>> matter their sizes or any other considerations. It is not to be >>> used unless you are doing some testing, debugging or on your way >>> to shut down the node. >>> >>> C*heers, >>> ----------------------- >>> Alain Rodriguez - @arodream - al...@thelastpickle.com >>> <mailto:al...@thelastpickle.com> >>> France >>> >>> The Last Pickle - Apache Cassandra Consulting >>> http://www.thelastpickle.com <http://www.thelastpickle.com/> >>> >>> 2017-05-04 11:38 GMT+01:00 Cogumelos Maravilha >>> <cogumelosmaravi...@sapo.pt <mailto:cogumelosmaravi...@sapo.pt>>: >>> >>> Hi all, >>> >>> I'm using C* 3.10. >>> >>> CREATE KEYSPACE mykeyspace WITH replication = {'class': >>> 'SimpleStrategy', 'replication_factor': '2'} AND >>> durable_writes = false; >>> >>> CREATE TABLE mykeyspace.data ( >>> id bigint PRIMARY KEY, >>> kafka text >>> ) WITH bloom_filter_fp_chance = 0.5 >>> AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} >>> AND comment = '' >>> AND compaction = {'class': >>> 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', >>> 'compaction_window_size': '10', 'compaction_window_unit': >>> 'HOURS', >>> 'max_threshold': '32', 'min_threshold': '6'} >>> AND compression = {'chunk_length_in_kb': '64', 'class': >>> 'org.apache.cassandra.io >>> <http://org.apache.cassandra.io/>.compress.LZ4Compressor'} >>> AND crc_check_chance = 0.0 >>> AND dclocal_read_repair_chance = 0.1 >>> AND default_time_to_live = 1555200 >>> AND gc_grace_seconds = 10800 >>> AND max_index_interval = 2048 >>> AND memtable_flush_period_in_ms = 0 >>> AND min_index_interval = 128 >>> AND read_repair_chance = 0.0 >>> AND speculative_retry = '99PERCENTILE'; >>> >>> UN 10.1.1.161 398.39 GiB 256 28.9% >>> UN 10.1.1.19 765.32 GiB 256 29.9% >>> UN 10.1.1.52 574.24 GiB 256 28.2% >>> UN 10.1.1.213 817.56 GiB 256 28.2% >>> UN 10.1.1.85 638.82 GiB 256 28.2% >>> UN 10.1.1.245 408.95 GiB 256 28.7% >>> UN 10.1.1.185 574.63 GiB 256 27.9% >>> >>> At crontab in all nodes (only changes the time): >>> >>> 08 05 * * * root nodetool repair -pr >>> 11 11 * * * root fstrim -a >>> 04 12 * * * root nodetool clearsnapshot >>> 33 13 * * 2 root nodetool cleanup >>> 35 15 * * * root nodetool garbagecollect >>> 46 19 * * * root nodetool clearsnapshot >>> 50 23 * * * root nodetool flush >>> >>> I can I fixed this? >>> >>> Thanks in advance. >>> >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: >>> user-unsubscr...@cassandra.apache.org >>> <mailto:user-unsubscr...@cassandra.apache.org> >>> For additional commands, e-mail: >>> user-h...@cassandra.apache.org >>> <mailto:user-h...@cassandra.apache.org> >>> >>> >> > >