Hey everyone,

We're stress testing writes for a few counter CFs and noticed one one node
we got to the point where the ReplicateOnWriteStage thread pool was backed
up and it started blocking those tasks. This cluster is six nodes, RF=3,
running 1.2.9. All CFs have LCS with 160 MB sstables. All writes were
CL.ONE.

Few questions:

   1. What causes a RoW (replicate of write) task to be blocked? The queue
   maxes out at 4128, which seems to be 32 * (128 + 1). 32 is the number of
   concurrent_writers we have.

   2. Given this is a counter CF, can those dropped RoWs be repaired with a
   "nodetool repair?" From my understanding of how counter writes work, until
   we run that repair, if we're not using CL.ALL / read_repair_chance = 1, we
   will get some incorrect reads, but a repair will fix things. Is that right?

   3. The CPU on the node where we started seeing the number of blocked
   tasks increase was pegged, but I/O was not saturated. There were
   compactions running on those column families as well. Is there a setting we
   could consider altering that might prevent that back up or is the answer
   likely, "increase the number of nodes to get more throughput."


Thanks in advance for any insights!

Andrew

Reply via email to