[
https://issues.apache.org/jira/browse/CASSANDRA-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14162837#comment-14162837
]
Aleksey Yeschenko commented on CASSANDRA-7507:
----------------------------------------------
The Python part should go into our CI system, probably - I don't want to
complicate our build script further.
+1 to the Java part, with some (subjective) nits:
- I'd rename JVMStabilityInspector#inspectThrowable() to
OOMInspector#maybeExit(), or something more obvious, like that. I get that the
intent is to make it future-proof, but I don't see any other exceptions that we
would consider unrecoverable, TBH. When and if that happens, we can generalize.
- For the same reason, for now we should remove the isUnstable vairable and
just do if (t instanceof OutOfMemoryError), + make the logged message mention
OOM, and not be generic, until and unless we add another exception to the list
of unrecoverables
- should probably use logger.fatal() instead of logger.error()
> OOM creates unreliable state - die instantly better
> ---------------------------------------------------
>
> Key: CASSANDRA-7507
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7507
> Project: Cassandra
> Issue Type: New Feature
> Reporter: Karl Mueller
> Assignee: Joshua McKenzie
> Priority: Minor
> Fix For: 2.1.1
>
> Attachments: 7507_v1.txt, 7507_v2.txt, 7507_v3_build.txt,
> 7507_v3_java.txt, exceptionHandlingResults.txt, findSwallowedExceptions.py
>
>
> I had a cassandra node run OOM. My heap had enough headroom, there was just
> something which either was a bug or some unfortunate amount of short-term
> memory utilization. This resulted in the following error:
> WARN [StorageServiceShutdownHook] 2014-06-30 09:38:38,251 StorageProxy.java
> (line 1713) Some hints were not written before shutdown. This is not
> supposed to happen. You should (a) run repair, and (b) file a bug report
> There are no other messages of relevance besides the OOM error about 90
> minutes earlier.
> My (limited) understanding of the JVM and Cassandra says that when it goes
> OOM, it will attempt to signal cassandra to shut down "cleanly". The problem,
> in my view, is that with an OOM situation, nothing is guaranteed anymore. I
> believe it's impossible to reliably "cleanly shut down" at this point, and
> therefore it's wrong to even try.
> Yes, ideally things could be written out, flushed to disk, memory messages
> written, other nodes notified, etc. but why is there any reason to believe
> any of those steps could happen? Would happen? Couldn't bad data be written
> at this point to disk rather than good data? Some network messages delivered,
> but not others?
> I think Cassandra should have the option to (and possibly default) to kill
> itself immediately upon the OOM condition happening in a hard way, and not
> rely on the java-based clean shutdown process. Cassandra already handles
> recovery from unclean shutdown, and it's not a big deal. My node, for
> example, kept in a sort-of alive state for 90 minutes where who knows what it
> was doing or not doing.
> I don't know enough about the JVM and options for it to know the best exact
> implementation of "die instantly on OOM", but it should be something that's
> possible either with some flags or a C library (which doesn't rely on java
> memory to do something which it may not be able to get!)
> Short version: a kill -9 of all C* processes in that instance without needing
> more java memory, when OOM is raised
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)