[
https://issues.apache.org/jira/browse/CASSANDRA-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15750908#comment-15750908
]
Benjamin Roth edited comment on CASSANDRA-12991 at 12/15/16 9:43 AM:
---------------------------------------------------------------------
I created a little script to calculate some possible scenarios:
https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99
Output:
{quote}
* Calculates the likeliness of a race condition leading to unnecessary repairs
* @see https://issues.apache.org/jira/browse/CASSANDRA-12991
*
* This assumes that all writes are equally spread over all token ranges
* and there is one subrange repair executed for each existing token range
3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000
req/s
Total Ranges: 768
Likeliness for RC per range: 0.39%
Unnecessary range repairs per repair: 3.00
3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000
req/s
Total Ranges: 768
Likeliness for RC per range: 1.56%
Unnecessary range repairs per repair: 12.00
8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000
req/s
Total Ranges: 2048
Likeliness for RC per range: 2.93%
Unnecessary range repairs per repair: 60.00
8 Nodes, 256 Tokens / Node, 20ms Mutation Latency, 1ms Validation Latency, 5000
req/s
Total Ranges: 2048
Likeliness for RC per range: 5.37%
Unnecessary range repairs per repair: 110.00
{quote}
You may ask why I entered latencies like 10ms or 20ms - this seems quite high.
It is indeed quite high for regular tables and a cluster that is not
overloaded. Under these conditions, the latency is dominated by your network
latency, so 1ms seems quite fair to me.
As soon as you use MVs and your cluster tends to overload, higher latencies are
not unrealistic.
You have to take into account that an MV operation does read before write and
the latency may vary very much. For MVs the latency is not (only) any more
dominated by network latency but by MV lock aquisition and read before write.
Both factors can introduce MUCH higher latencies, depending on concurrent
operations on MV, number of SSTables, compaction strategy, just everything that
affects read performance.
If your cluster is overloaded, these effects have an even higher impact.
I observed MANY situations on our production system where writes timed out
during streaming because of lock contention and or RBW impacts. These
situations mainly pop up during repair sessions when streams cause bulk
mutation applies (see StreamReceiverTask path for MVs). Impact is even higher
due to CASSANDRA-12888. Parallel repairs like e.g. reaper does, makes the
situation even more unpredictable and increases "drifts" of nodes, like Node A
is overloaded but Node B not because Node A receives a stream from a different
repair but Node B does not.
This is a vicious circle driven several factors:
- Stream puts pressure on nodes - especially larg(er) partitions
- hints tend to queue up
- hint delivery puts more pressure
- retransmission of failed hint delivery puts even more pressure
- latencies go up
- stream validations drift
- more (unnecessary) streams
- goto 0
This calculation example is just hypothetic. This *may* happen as calculated
but it totally depends on the model, cluster dimensions, cluster load, write
activity, distribution of writes and repair execution. I don't claim that
fixing this issue will remove all MV performance problems but it may be helps
to remove one impediment in the mentioned vicious circle.
My proposal is NOT to control flushes. This is far too complicated and wont
help. A flush, whenever it may happen and whatever range it flushes may or may
not contain a mutation that _should_ be there. What helps is to cut off all
data retrospectively at a synchronized and fix timestamp when executing the
validation. You can define a grace period (GP). When you start validation at VS
on the repair coordinator, then you expect all mutations to arrive no later
than VS that were created before VS - GP. That can be done at SSTable scanner
level by filtering all events (cells, tombstones) after VS - GP during
validation compaction. Something like the opposite of purging tombstones after
GCGS.
was (Author: brstgt):
I created a little script to calculate some possible scenarios:
https://gist.github.com/brstgt/447533208f6afa25a77c9a963b114f99
Output:
{quote}
* Calculates the likeliness of a race condition leading to unnecessary repairs
* @see https://issues.apache.org/jira/browse/CASSANDRA-12991
*
* This assumes that all writes are equally spread over all token ranges
* and there is one subrange repair executed for each existing token range
3 Nodes, 256 Tokens / Node, 1ms Mutation Latency, 1ms Validation Latency, 1000
req/s
Total Ranges: 768
Likeliness for RC per range: 0.39%
Unnecessary range repairs per repair: 3.00
3 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 1000
req/s
Total Ranges: 768
Likeliness for RC per range: 1.56%
Unnecessary range repairs per repair: 12.00
8 Nodes, 256 Tokens / Node, 10ms Mutation Latency, 1ms Validation Latency, 5000
req/s
Total Ranges: 2048
Likeliness for RC per range: 2.93%
Unnecessary range repairs per repair: 60.00
8 Nodes, 256 Tokens / Node, 20ms Mutation Latency, 1ms Validation Latency, 5000
req/s
Total Ranges: 2048
Likeliness for RC per range: 5.37%
Unnecessary range repairs per repair: 110.00
{quote}
You may ask why I entered latencies like 10ms or 20ms - this seems quite high.
It is indeed quite high for regular tables and a cluster that is not
overloaded. Under these conditions, the latency is dominated by your network
latency, so 1ms seems quite fair to me.
As soon as you use MVs and your cluster tends to overload, higher latencies are
not unrealistic.
You have to take into account that an MV operation does read before write and
the latency may vary very much. For MVs the latency is not (only) any more
dominated by network latency but by MV lock aquisition and read before write.
Both factors can introduce MUCH higher latencies, depending on concurrent
operations on MV, number of SSTables, compaction strategy, just everything that
affects read performance.
If your cluster is overloaded, these effects have an even higher impact.
I observed MANY situations on our production system where writes timed out
during streaming because of lock contention and or RBW impacts. These
situations mainly pop up during repair sessions when streams cause bulk
mutation applies (see StreamReceiverTask path for MVs). Impact is even higher
due to CASSANDRA-12888. Parallel repairs like e.g. reaper does, makes the
situation even more unpredictable and increases "drifts" of nodes, like Node A
is overloaded but Node B not because Node A receives a stream from a different
repair but Node B does not.
This is a vicious circle driven several factors:
- Stream puts pressure on nodes - especially larg(er) partitions
- hints tend to queue up
- hint delivery puts more pressure
- retransmission of failed hint delivery puts even more pressure
- latencies go up
- stream validations drift
- more (unnecessary) streams
- goto 0
This calculation example is just hypothetic. This *may* happen as calculated
but it totally depends on the model, cluster dimensions, cluster load, write
activity, distribution of writes and repair execution. I don't claim that
fixing this issue will remove all MV performance problems but it may be helps
to remove one impediment in the mentioned vicious circle.
My proposal is NOT to control flushes. This is far too complicated and wont
help. A flush, whenever it may happen and whatever range it flushes may or may
not contain a mutation that _should_ be there. The only thing that helps is to
cut off all data retrospectively at a synchronized and fix timestamp when
executing the validation. You can only define a grace period (GP). When you
start validation at VS on the repair coordinator, then you expect all mutations
to arrive no later than VS that were created before VS - GP. That can IMHO only
be done at SSTable scanner level by filtering all events (cells, tombstones)
after VS - GP during validation compaction. Something like the opposite of
purging tombstones after GCGS.
> Inter-node race condition in validation compaction
> --------------------------------------------------
>
> Key: CASSANDRA-12991
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12991
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Benjamin Roth
> Priority: Minor
>
> Problem:
> When a validation compaction is triggered by a repair it may happen that due
> to flying in mutations the merkle trees differ but the data is consistent
> however.
> Example:
> t = 10000:
> Repair starts, triggers validations
> Node A starts validation
> t = 10001:
> Mutation arrives at Node A
> t = 10002:
> Mutation arrives at Node B
> t = 10003:
> Node B starts validation
> Hashes of node A+B will differ but data is consistent from a view (think of
> it like a snapshot) t = 10000.
> Impact:
> Unnecessary streaming happens. This may not a big impact on low traffic CFs,
> partitions but on high traffic CFs and maybe very big partitions, this may
> have a bigger impact and is a waste of resources.
> Possible solution:
> Build hashes based upon a snapshot timestamp.
> This requires SSTables created after that timestamp to be filtered when doing
> a validation compaction:
> - Cells with timestamp > snapshot time have to be removed
> - Tombstone range markers have to be handled
> - Bounds have to be removed if delete timestamp > snapshot time
> - Boundary markers have to be either changed to a bound or completely
> removed, depending if start and/or end are both affected or not
> Probably this is a known behaviour. Have there been any discussions about
> this in the past? Did not find an matching issue, so I created this one.
> I am happy about any feedback, whatsoever.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)