Full disk can result in being marked down
-----------------------------------------

                 Key: CASSANDRA-809
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-809
             Project: Cassandra
          Issue Type: Bug
    Affects Versions: 0.5, 0.6, 0.7
            Reporter: Ryan King


We had a node file up the disk under one of two data directories. The result 
was that the node stopped making progress. The problem appears to be this (I'll 
update with more details as we find them):

When new tasks are put onto most queues in Cassandra, if there isn't a thread 
in the pool to handle the task immediately, the task in run in the caller's 
thread
(org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor:69 sets the 
caller-runs policy).  The queue in question here is the queue that manages 
flushes, which is enqueued to from various places in our code (and therefore 
likely from multiple threads). Assuming that the full disk meant that no 
threads doing flushing could make progress (it appears that way) eventually any 
thread that calls the flush code would become stalled.

Assuming our analysis is right (and we're still looking into it) we need to 
make a change. Here's a proposal so far:

SHORT TERM:
* change the  TheadPoolExecutor policy to not be caller runs. This will let 
other threads make progress in the event that one pool is stalled

LONG TERM
* It appears that there are n threads for n data directories that we flush to, 
but they're not dedicated to a data directory. We should have a thread per data 
directory and have that thread dedicated to that directory
* Perhaps we could use the failure detector on disks?


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to