Hello -

I have a 48 node C* cluster spread across 4 AWS regions with RF=3. A few
months ago I started noticing disk usage on some nodes increasing
consistently. At first I solved the problem by destroying the nodes and
rebuilding them, but the problem returns.

I did some more investigation recently, and this is what I found:
- I narrowed the problem down to a CF that uses TWCS, by simply looking at
disk space usage
- in each region, 3 nodes have this problem of growing disk space (matches
replication factor)
- on each node, I tracked down the problem to a particular SSTable using
`sstableexpiredblockers`
- in the SSTable, using `sstabledump`, I found a row that does not have a
ttl like the other rows, and appears to be from someone else on the team
testing something and forgetting to include a ttl
- all other rows show "expired: true" except this one, hence my suspicion
- when I query for that particular partition key, I get no results
- I tried deleting the row anyways, but that didn't seem to change anything
- I also tried `nodetool scrub`, but that didn't help either

Would this rogue row without a ttl explain the problem? If so, why? If not,
does anyone have any other ideas? Why does the row show in `sstabledump`
but not when I query for it?

I appreciate any help or suggestions!

- Mike

Reply via email to