I don't have specific experience relating to InstanceTidier but when I saw this, I immediately thought of repairs blowing up the heap. 40K instances indicates to me that you have thousands of SSTables -- are they tiny (like 1MB or less)? Otherwise, are they dense nodes (~1TB or more)?
How do you run repairs? I'm wondering if it's possible that there are multiple repairs running in parallel like a cron job kicking in while the previous repair is still running. You didn't specify your C* version but my guess is that it's pre-3.11.5. FWIW the repair issue I'm referring to is CASSANDRA-14096 [1]. [1] https://issues.apache.org/jira/browse/CASSANDRA-14096