Karl Mueller created CASSANDRA-7507:
---------------------------------------
Summary: OOM creates unreliable state - die instantly better
Key: CASSANDRA-7507
URL: https://issues.apache.org/jira/browse/CASSANDRA-7507
Project: Cassandra
Issue Type: New Feature
Reporter: Karl Mueller
Priority: Minor
I had a cassandra node run OOM. My heap had enough headroom, there was just
something which either was a bug or some unfortunate amount of short-term
memory utilization. This resulted in the following error:
WARN [StorageServiceShutdownHook] 2014-06-30 09:38:38,251 StorageProxy.java
(line 1713) Some hints were not written before shutdown. This is not supposed
to happen. You should (a) run repair, and (b) file a bug report
There are no other messages of relevance besides the OOM error about 90 minutes
earlier.
My (limited) understanding of the JVM and Cassandra says that when it goes OOM,
it will attempt to signal cassandra to shut down "cleanly". The problem, in my
view, is that with an OOM situation, nothing is guaranteed anymore. I believe
it's impossible to reliably "cleanly shut down" at this point, and therefore
it's wrong to even try.
Yes, ideally things could be written out, flushed to disk, memory messages
written, other nodes notified, etc. but why is there any reason to believe any
of those steps could happen? Would happen? Couldn't bad data be written at this
point to disk rather than good data? Some network messages delivered, but not
others?
I think Cassandra should have the option to (and possibly default) to kill
itself immediately upon the OOM condition happening in a hard way, and not rely
on the java-based clean shutdown process. Cassandra already handles recovery
from unclean shutdown, and it's not a big deal. My node, for example, kept in a
sort-of alive state for 90 minutes where who knows what it was doing or not
doing.
I don't know enough about the JVM and options for it to know the best exact
implementation of "die instantly on OOM", but it should be something that's
possible either with some flags or a C library (which doesn't rely on java
memory to do something which it may not be able to get!)
Short version: a kill -9 of all C* processes in that instance without needing
more java memory, when OOM is raised
--
This message was sent by Atlassian JIRA
(v6.2#6252)