memtable flushing can block writes due to queue size limitations even though
overall write throughput is below capacity
-----------------------------------------------------------------------------------------------------------------------
Key: CASSANDRA-1955
URL: https://issues.apache.org/jira/browse/CASSANDRA-1955
Project: Cassandra
Issue Type: Improvement
Reporter: Peter Schuller
It seems that someone ran into this (see cassandra-user thread "Question re:
the use of multiple ColumnFamilies").
If my interpretation is correct, the queue size is set to the concurrency in
the case of the flushSorter, and set to memtable_flush_writers in the case of
flushWriter in ColumnFamilyStore.
While the choice of concurrency for the two executors makes perfect sense, the
queue sizing does not. As a user, I would expect, and did expect, that for a
given memtable independently tuned (w.r.t. flushing thresholds etc), writes to
the CF would not block until there is at least one other memtable *for that CF*
waiting to be flushed.
With the current behavior, if I am not misinterpreting, whether or not writes
will inappropriately block is very much dependent on not just the overall write
throughput, but also the incidental timing of memtable flushes across multiple
column families.
The simplest way to mitigate (but not fix) this is probably to set the queue
size to be equal to the number of column families if that is higher than the
number of CPU cores. But that is only a mitigation because nothing prevents
e.g. a large number of memtable flushes for a small column family under
temporary write load, can still block a large (possibly more important)
memtable flush for another CF. Such a shared-but-larger queue would also not
prevent heap usage spikes resulting from some a single cf with very large
memtable thresholds being rapidly written to, with a queue sized for lots of
cf:s that are in practice not used. In other words, this mitigation technique
would effectively negate the backpressure mechanism in some cases and likely
lead to more people having OOM issues when saturating a CF with writes.
A more involved change is to make each CF have it's own queue through which
flushes go prior to being submitted to flushSorter, which would guarantee that
at least one memtable can always be in pending flush state for a given CF. The
global queue could effectively have size 1 hard-coded since the queue is no
longer really used as if it were a queue.
The flushWriter is unaffected since it is a separate concern that is supposed
to be I/O bound. The current behavior would not be perfect if there is a huge
discrepancy between memtable flush thresholds of different memtables, but it
does not seem high priority to make a change here in practice.
So, I propose either:
(a) changing the flushSorter queue size to be max(num cores, num cfs)
(b) creating a per-cf queue
I'll volunteer to work on it as a nice bite sized change, assuming there is
agreement on what needs to be done. Given the concerns with (a), I think (b) is
the right solution unless it turns out to cause major complexity. Worth noting
is that these are not performance sensitive given the low frequency of memtable
flushes, so an extra queue:ing step should not be an issue.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.