It happened again today and I had a bit more time to probe stuff. It seems
all non periodic tasks execute on a single thread. so if one thread where
to get stuck work would simply pile up until out of memory, i did a series
of stack dumps and it always seemed to look something like this
It did look like there where repairs running at the time. The
LiveSSTableCount for the entire node is about 2200 tables, for the keyspace
that was being repaired its just 150
We run cassandra 3.11.6 so we should be unaffected by cassandra-14096
We use http://cassandra-reaper.io/ for the repairs
Oh, I just saw on ASF Slack that you were already discussing it earlier
today with driftx in the #cassandra channel. Cheers!
>
I don't have specific experience relating to InstanceTidier but when I saw
this, I immediately thought of repairs blowing up the heap. 40K instances
indicates to me that you have thousands of SSTables -- are they tiny (like
1MB or less)? Otherwise, are they dense nodes (~1TB or more)?
How do you