[jira] [Updated] (CASSANDRA-7507) OOM creates unreliable state - die instantly better

Joshua McKenzie (JIRA) Mon, 21 Jul 2014 12:31:27 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joshua McKenzie updated CASSANDRA-7507:
---------------------------------------

    Attachment: exceptionHandlingResults.txt
                findSwallowedExceptions.py

Some quick and dirty parsing of the code base gives me:
{code:title=Parse_Results}
Total Exceptions caught and rethrown as something other than RuntimeException: 
69              
Total Exceptions caught and rethrown as RuntimeException: 76                    
               
Total Exceptions Swallowed: 89                                                  
      
Total Exceptions analyzed: 234                                                  
      
{code}

While incredibly rough and not handling a bunch of edge-cases in the exception 
handling blocks, as a general metric of how pervasive these blanket catches are 
within the code-base this data should be helpful to get a general feel for how 
much work we're proposing here.

Attaching findSwallowedExceptions.py and exceptionHandlingResults.txt which is 
output from the script.

While I agree that this is something we need to look into, I'm not sure 2.1.1 
and this ticket is the right place to make changes that are this extensive.  
3.0 and another ticket to specifically go through the code-base and clean up 
blanket exception catch blocks to not swallow and be very specific on the logic 
surrounding rethrows would work for me.  WDYT [~jbellis]?

> OOM creates unreliable state - die instantly better
> ---------------------------------------------------
>
>                 Key: CASSANDRA-7507
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7507
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Karl Mueller
>            Assignee: Joshua McKenzie
>            Priority: Minor
>             Fix For: 2.1.1
>
>         Attachments: 7507_v1.txt, 7507_v2.txt, exceptionHandlingResults.txt, 
> findSwallowedExceptions.py
>
>
> I had a cassandra node run OOM. My heap had enough headroom, there was just 
> something which either was a bug or some unfortunate amount of short-term 
> memory utilization. This resulted in the following error:
>  WARN [StorageServiceShutdownHook] 2014-06-30 09:38:38,251 StorageProxy.java 
> (line 1713) Some hints were not written before shutdown.  This is not 
> supposed to happen.  You should (a) run repair, and (b) file a bug report
> There are no other messages of relevance besides the OOM error about 90 
> minutes earlier.
> My (limited) understanding of the JVM and Cassandra says that when it goes 
> OOM, it will attempt to signal cassandra to shut down "cleanly". The problem, 
> in my view, is that with an OOM situation, nothing is guaranteed anymore. I 
> believe it's impossible to reliably "cleanly shut down" at this point, and 
> therefore it's wrong to even try. 
> Yes, ideally things could be written out, flushed to disk, memory messages 
> written, other nodes notified, etc. but why is there any reason to believe 
> any of those steps could happen? Would happen? Couldn't bad data be written 
> at this point to disk rather than good data? Some network messages delivered, 
> but not others?
> I think Cassandra should have the option to (and possibly default) to kill 
> itself immediately upon the OOM condition happening in a hard way, and not 
> rely on the java-based clean shutdown process. Cassandra already handles 
> recovery from unclean shutdown, and it's not a big deal. My node, for 
> example, kept in a sort-of alive state for 90 minutes where who knows what it 
> was doing or not doing.
> I don't know enough about the JVM and options for it to know the best exact 
> implementation of "die instantly on OOM", but it should be something that's 
> possible either with some flags or a C library (which doesn't rely on java 
> memory to do something which it may not be able to get!)
> Short version: a kill -9 of all C* processes in that instance without needing 
> more java memory, when OOM is raised



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (CASSANDRA-7507) OOM creates unreliable state - die instantly better

Reply via email to