I've been experiencing a behavior that is undesirable and it seems like a
bug that causes a high amount of wasted work.

I have a CF where all columns have a TTL, are generally all inserted in a
very short period of time (less than a second) and are never over-written
or explicitly deleted.  Eventually one node will run a compaction and
remove rows containing only tombstones greater than gc_grace_seconds old
which is expected.

The problem comes up when a repair is run.  During the repair the other
nodes that haven't run a compaction and still have the tombstoned rows
"fix" the inconsistency and stream the rows (which contain only a tombstone
which is more than gc_grace_seconds old) back to the node which had
compacted that row away.  This ends up occurring over and over and uses a
lot of time, storage, and bandwidth to keep repairing rows that are
intentionally missing.

I think the issue stems from the behavior of compaction of TTL rows and
repair.  The compaction of TTL rows is a node-local event which will
eventually cause tombstoned rows to disappear from the one node doing the
compaction and then get "repaired" from replicas later.  I guess this could
happen for rows which are explicitly deleted as well.

Is this a feature or a bug?  How can I avoid repair of rows that were
correctly removed via compaction from one node but not from replicas just
because compactions run independently on each node?  Every repair ends up
streaming tens of gigabytes of "missing" rows to and from replicas.

Cassandra 1.1.5 with size tiered compaction strategy and RF=3

-Bryan

Reply via email to