Re: Large number of tiny sstables flushed constantly

Bowen Song Mon, 16 Aug 2021 11:56:33 -0700

Hi Jiayong,

You will need to reduce the num_tokens on all existing nodes in thecluster in order to "fix" the repair. Only adding new DCs with lowernum_tokens value is not going to solve the problem for you. In practice,you have two ways to reduce it on all existing nodes. You can eitherbootstrap a new DC and then completely decommission the old DC, orrepeatedly decommission and re-join existing nodes with a lowernum_tokens value as far as the free disk space and CPU load allows,repeat it on all nodes in the same DC multiple times until you'vereached the desired num_tokens on all nodes in the DC. It at the end isa trade-off between money (a lots of new servers) and time (possiblymany months for a large cluster). By not addressing the issue, you arerisking serious data consistency issues, reduced availability and faulttolerance, and worse, data loss caused by disk failures (uncorrectableerrors and random data corruptions, not necessarily causing the entiredisk to fail) in all replica nodes. Those issues can often negativelyaffect your core business, and sometimes the cost to your business canbe very expensive. It's better to do it while the size of the cluster isstill manageable, because it will only get harder and worse as thecluster grows alone with the size of data.

To give you an idea of how entropy destroys your data availability andeven cause data loss, imagine a cluster of 3 nodes with RF=3 (i.e.: eachnode has 100% of the data) and application read and write at CL=QUORUM(or LOCAL_QUORUM). The cluster is suppose to have strong dataconsistency and bery good availability. Now, imagine after years ofoperation, for whatever reason the application starts to read some "colddata" which has never been read since they were written many years ago,and you may soon start to discover that the cold SSTables on one replicais still readable but completely corrupted by data error caused by diskmalfunctioning and another replica is giving you UNCs at randomlocations in their copy of data, and that means the CL=QUORUM querieswill fail. Yet, that's not the worst. If repair was never run on thiscluster, the 3rd and the only remaining replica may hold an outdatedincorrect copy of the data, or it may not even have the data because itwas down when the data were written, and the hints were lost (timed out,or manually removed by you) too. Here, you are looking at a realunrecoverable data loss in the cluster, and you will be left in aposition that's totally relying on the existence and correctness of abackup. You have been periodically backing up the data, right? You havetested the restore procedure recently, right? You are confidence thatthe data in the backup was taken from the only correct replica beforethe UNC occurs, not the corrupted or the outdated replicas, right? Goodluck if you can answer yes to all above three.

In short, fix your cluster, and start to run repairs periodically assoon as you can. Repair doesn't only fix data inconsistencies, it alsodefend against silent data corruptions and UNCs.



Cheers,

Bowen

On 16/08/2021 18:20, Jiayong Sun wrote:

Hi Bowen,
> "how many tables are being written frequently" - there are normallyabout less then 10 tables being written concurrently.Regarding the "num_token", unfortunately we can change it now since itwould require rebuilding all rings/nodes of the cluster. Actually weused to add a ring with num_token: 4 in this cluster but later removeit due to some other issue. We have started using num_token: 16 asstandard for any new clusters.
Thanks,
Jiayong Sun
On Monday, August 16, 2021, 02:46:24 AM PDT, Bowen Song <bo...@bso.ng>wrote:
Hello Jiayong,


/>//There is only one major table taking 90% of writes. /
In your case, what matters isn't the 90% writes, but the remaining10%. If the remaining 10% writes are spread over 100 different tables,your Cassandra node will end up flushing over 600 times per minute.Therefore, "how many tables are being written frequently" is still avalid question, even vast majority of the writes are in just one table.
/> Currently we don't set the "commitlog_total_space_in_mb" value soit should be using default of 8192 MB (8 GB). What do you suggest forthis parameter's value?/
For the sake of saving the time spent on troubleshooting, increase thecommitlog_total_space_in_mb as much as your free disk space allows,and definitely no less than 32GB (that's only ~100 minutes worth ofcommit log). If your free disk space is very large, set it to ~230GB(~12 hour worth of commit log) is a good starting point.
/> Since our reaper repair is not working effectively as now, we stillneed to rely on the hints to reply writes cross nodes and DC. Do youhave any suggestion for how old hints could be safely removed withoutimpacting data consistency? I think this question may be depending onmany factors but I was wondering if there is any kind of rule of thumb?/
Unfortunately, there is no safe way to remove not-yet-deliveredhintswithout impacting data consistency if you don't run repairperiodically. Cassandra will remove the delivered hints automatically,so you are only removing the not-yet-delivered hints if you aremanually removing them. My recommendation is to reduce the num_tokensin all nodes in order to "fix" the repair. This, of course, willrequire a lots of planning, and the actual process will involve movingdata around servers for quiet a few times (or a large number of newservers, depending on your plan).
Cheers,

Bowen

On 16/08/2021 05:48, Jiayong Sun wrote:
Hi Bowen,

There is only one major table taking 90% of writes.
I will try to increase the "commitlog_segment_size_in_mb" value to 1GB and set "max_mutation_size_in_kb" to 16MB.Currently we don't set the "commitlog_total_space_in_mb" value so itshould be using default of 8192 MB (8 GB). What do you suggest forthis parameter's value?
By the way we don't back up the commit log.
Since our reaper repair is not working effectively as now, we stillneed to rely on the hints to reply writes cross nodes and DC. Do youhave any suggestion for how old hints could be safely removed withoutimpacting data consistency? I think this question may be depending onmany factors but I was wondering if there is any kind of rule of thumb?
Thanks,
Jiayong Sun
On Sunday, August 15, 2021, 05:58:11 AM PDT, Bowen Song <bo...@bso.ng><mailto:bo...@bso.ng> wrote:
Hi Jiayong,


Based on this statement:

/> //We see the commit logs switched about 10 times per minutes/
I'd also like to know, roughly speaking, how many tables in Cassandraare being written frequently? I'm asking this because the commit logsegments are being created (and recycled) so frequently (~ every 6seconds), and I suspect that a lots of tables are involved in eachcommit log segment, and that leads to many SSTable flushes.
You could try to drain a node, and then remove all commit logs fromthe node and increase the "commitlog_segment_size_in_mb" value tosomething much larger (say, 1GB), and also increase thecommitlog_total_space_in_mb accordingly on this node, and see if thishelps improving the situation. Note that you may also want to manuallyset the "max_mutation_size_in_kb" to 16MB (the default value is halfof the commit segment size) to prevent unexpected extra large sizedmutations get accepted on this node and then failing on other nodes.Please also note that this may interfere with some backup tools whichbacks up the commit log segments.
In addition to that, if you periodically purge the hints, you probablyare better off by just disabling hinted handoff and make sure youalways run repair within the gc_grace_seconds.
Cheers,

Bowen

On 14/08/2021 03:33, Jiayong Sun wrote:
Hi Bowen,

Thanks for digging into source code so deep.
Here are answers to your questions:

  * Does your application changes the table schema frequently?
    Specifically: alter table, drop table and drop keyspace. - No,
    either admin or apps doesn't frequently alter/drop/create table
    schema in run-time.
  * Do you have the memtable_flush_period_in_ms table property set to
    non-zero on any table at all? - all tables use
    "memtable_flush_period_in_ms = 0" for default.
  * Is the timing of frequent small SSTable flushes coincident with
    streaming activities? - The repair job is paused and don't see
    streaming occurring in system.log
  * What's your commitlog_segment_size_in_mb and
    commitlog_total_space_in_mb in cassandra.yaml and what's your free
    space size on the disk where commit log is located? -
    "commitlog_segment_size_in_mb: 32". "commitlog_total_space_in_mb"
    is not set. The commit logs have separate disk and no chance it'd
    be filled up. Data disk is about 50% used.
  * How fast do you fill up a commit log segment? I.e.: how fast are
    you writing to Cassandra? - We see the commit logs switched about
    10 times per minutes, and lots of hints cumulated on disk and
    replayed constantly. This could be due to many nodes unresponsive
    due to this ongoing issue.
  * Anything invoking the "nodetool flush" or "nodetool drain"
    command? - no, we don't issue these commands unless restarting a
    node manually or automatically through some monitoring mechanism
    which is not happening frequesntly.
I doubt if this could be related with huge amount of hints replay. Thecluster is stressed by heave writes through spark jobs and there aresome hot-spot partitions. Huge amount of hints (e.g. >50GB per node)are not uncommon in this cluster especially since this issue has beenoccurring causing many node lost gossip. We have to set up a dailycron job to clear the older hints from disk, but not sure if thiswould hurt data inconsistency among nodes and DCs.
Thoughts?

Thanks,
Jiayong Sun
On Friday, August 13, 2021, 03:39:44 PM PDT, Bowen Song <bo...@bso.ng><mailto:bo...@bso.ng> wrote:
Hi Jiayong,
I'm sorry to hear that. I did not know many nodes were/areexperiencing the same issue. A bit of dig in the source code indicatesthe log below comes from the ColumnFamilyStore.logFlush() method.
    DEBUG [NativePoolCleaner] <timestamp> ColumnFamilyStore.java:932 -
    Enqueuing flush of sstable_activity: 0.408KiB (0%) on-heap,
    0.154KiB (0%) off-heap
The ColumnFamilyStore.logFlush() method is a private method and theonly place referencing to it is the ColumnFamilyStore.switchmemtable()method in the same file, and that has been referenced in two places -ColumnFamilyStore.reload() andColumnFamilyStore.switchMemtableIfCurrent(). Ignoring secondary indexand MV, the former is only being called during the node starts up andon table schema changes, and it's unlikely our suspect (unless you arefrequently changing the schema or restarting the node). The later isbeing referenced in two methods: ColumnFamilyStore.forceFlush() andColumnFamilyStore.FlushLargestColumnFamily.run(). TheColumnFamilyStore.FlushLargestColumnFamily.run() method is only calledby the MemtableCleanerThread, and we have pretty much ruled that outin the previous conversations. The forceFlush() method is invoked ifthe table property memtable_flush_period_in_ms is set, when Cassandrais preparing for sending/receiving files via streaming, when an oldsegment of commit log is recycled,on "nodetool drain", and again, onschema changes (drop keyspace/table).
So, my questions would be:

  * Does your application changes the table schema frequently?
    Specifically: alter table, drop table and drop keyspace.
  * Do you have the memtable_flush_period_in_ms table property set to
    non-zero on any table at all?
  * Is the timing of frequent small SSTable flushes coincident with
    streaming activities?
  * What's your commitlog_segment_size_in_mb and
    commitlog_total_space_in_mb in cassandra.yaml and what's your free
    space size on the disk where commit log is located?
  * How fast do you fill up a commit log segment? I.e.: how fast are
    you writing to Cassandra?
  * Anything invoking the "nodetool flush" or "nodetool drain" command?

I hope the above questions will help you find the root cause.

Cheers,
Bowen

On 13/08/2021 22:47, Jiayong Sun wrote:
Hi Bowen,
There are many nodes having this issue and some of them repeatedlyhaving it.Replacing a node by wiping out everything and streaming in good shapeof sstables would work, but if we don't know the root cause the nodewould be in the bad shape again.Yes, we know the reaper repair running so long like weeks is not goodwhich most likely due to the multiple DC with large size of rings. Weare planning to upgrade to newer version of reaper to see if that helps.We do have debug.log turned on but didn't catch anything helpful otherthan those constant enqueuing/flashing/deleting of memtable andsstables (I listed a few examples messages at beginning of this emailthread).
Thanks for all your thoughts and I really appreciate.

Thanks,
Jiayong Sun
On Friday, August 13, 2021, 01:36:21 PM PDT, Bowen Song <bo...@bso.ng><mailto:bo...@bso.ng> wrote:
Hi Jiayong,
That doesn't really match the situation described in the SO question.I suspected it was related to repairing a table with MV and largepartitions, but based on the information you've given, I was clearlywrong.
A few hundreds MB partitions is not exactly unusual, I don't see thatalone could lead to frequent SSTable flushing. A repair session takesweeks to complete is a bit worrying in terms of performance andmaintainability, but again it should not cause this issue.
Since we don't know the cause of it, I can see two possible solutions- either replace the "broken" node, or dig into the logs (remember toturn on the debug logs) and trying to identify the root cause. Ipersonally would recommend replacing the problematic node as a quick win.
Cheers,

Bowen

On 13/08/2021 20:31, Jiayong Sun wrote:
Hi Bowen,
We do have reaper repair job scheduled periodically and it can takedays even weeks to complete one round of repair due to large number ofrings/nodes. However, we have paused the repair since we are facingthis issue.
We do not use the MV in this cluster.
There is major table taking 95% of disk storage and workload but itsPartition Size is around 30 MB. There are a couple small tables withthe Max Partition Size over several hundreds of MB but their totaldata size just about a few GB.
Any thoughts?

Thanks,
Jiayong
On Friday, August 13, 2021, 03:32:45 AM PDT, Bowen Song <bo...@bso.ng><mailto:bo...@bso.ng> wrote:
Hi Jiayong,
Sorry I didn't make it clear in my previous email. When I commented onthe RAID0 setup, it was only a comment on the RAID0 setup vs JBOD, andthat was not in relation to the SSTable flushing issue. The part of myprevious email after the "On the frequent SSTable flush issue" line isthe part related to the SSTable flushing issue, and those twoquestions at the end of it remain valid:
  * Did you run repair?
  * Do you use materialized views?

and, if I may, I'd also like to add another question:

  * Do you have large (> 100 MB) partitions?
Those are the 3 things mentioned in the SO question. I'm trying tofind the connections between the issue you are experiencing and theissue described in the SO question.
Cheers,

Bowen


On 13/08/2021 01:36, Jiayong Sun wrote:
Hello Bowen,

Thanks for your response.
Yes, we are aware of the theory that RAID0 vs individual JBOD, but allof our clusters are using this RAID0 configuration through Azure,while only on this cluster we see this issue so it's hardly toconclude root cause to the disk. This is more like workload related,and we are seeking feedback here for any other parameters in the yamlthat we could tune for this.
Thanks again,
Jiayong Sun
On Thursday, August 12, 2021, 04:55:51 AM PDT, Bowen Song<bo...@bso.ng> <mailto:bo...@bso.ng> wrote:
Hello Jiayong,
Using multiple disks in a RAID0 for Cassandra data directory is notrecommended. You will get better fault tolerance and often betterperformance too with multiple data directories, one on each disk.
If you stick with RAID0, it's not 4 disks, it's 1 from Cassandra'spoint of view, because any read or write operation will have to touchall 4 member disks. Therefore, 4 flush writers doesn't make much sense.
On the frequent SSTable flush issue, a quick internet search leads me to:

    * an old bug in Cassandra 2.1 - CASSANDRA-8409
    <https://issues.apache.org/jira/browse/CASSANDRA-8409> which
    shouldn't affect 3.x at all

    * a StackOverflow question
    
<https://stackoverflow.com/questions/61030392/cassandra-node-jvm-hang-during-node-repair-a-table-with-materialized-view>
    may be related

Did you run repair? Do you use materialized views?


Regards,

Bowen


On 11/08/2021 15:58, Jiayong Sun wrote:
Hi Erick,
The nodes have 4 SSD (1TB for each but we only use 2.4TB of space.Current disk usage is about 50%) with RAID0.Based on number of disks we increased memtable_flush_writers: 4instead of default of 2.
For the following we set:
- max heap size - 31GB
- memtable_heap_space_in_mb (use default)
- memtable_offheap_space_in_mb  (use default)
In the logs, we also noticed system.sstable_activity table hashundreds of MB or GB of data and constantly flushing:DEBUG [NativePoolCleaner] <timestamp> ColumnFamilyStore.java:932 -Enqueuing flush of sstable_activity: 0.293KiB (0%) on-heap, 0.107KiB(0%) off-heapDEBUG [NonPeriodicTasks:1] <timestamp> SSTable.java:105 - Deletingsstable:/app/cassandra/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/md-103645-bigDEBUG [NativePoolCleaner] <timestamp> ColumnFamilyStore.java:1322 -Flushing largest CFS(Keyspace='system',ColumnFamily='sstable_activity') to free up room. Used total:0.06/1.00, live: 0.00/0.00, flushing: 0.02/0.29, this: 0.00/0.00
Thanks,
Jiayong Sun
On Wednesday, August 11, 2021, 12:06:27 AM PDT, Erick Ramirez<erick.rami...@datastax.com> <mailto:erick.rami...@datastax.com> wrote:
4 flush writers isn't bad since the default is 2. It doesn't make adifference if you have fast disks (like NVMe SSDs) because only 1thread gets used.
But if flushes are slow, the work gets distributed to 4 flush writersso you end up with smaller flush sizes although it's difficult to tellhow tiny the SSTables would be without analysing the logs and overallperformance of your cluster.
Was there a specific reason you decided to bump it up to 4? I'm justtrying to get a sense of why you did it since it might provide someclues. Out of curiosity, what do you have set for the following?
- max heap size
- memtable_heap_space_in_mb
- memtable_offheap_space_in_mb

Re: Large number of tiny sstables flushed constantly

Reply via email to