[ 
https://issues.apache.org/jira/browse/CASSANDRA-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Tunnicliffe updated CASSANDRA-15812:
----------------------------------------
    Test and Documentation Plan: new unit tests added, existing dtests modified 
 
                         Status: Patch Available  (was: In Progress)

I've pushed a branch 
[here|https://github.com/beobal/cassandra/tree/15812-trunk] with a fix for 
this, along with a couple of minor follow up commits.

The main fix is to switch the work queue in {{ValidationExecutor}} to a 
{{LinkedBlockingQueue}}, rather than a {{SynchronousQueue}}. When using the 
latter, the executor will spawn new threads until the max pool size is reached, 
but then block the caller until capacity becomes available. Using {{LBQ}} will 
allow additional tasks to be queued but also requires {{corePoolSize}} to be 
set appropriately as once that threshold is reached, new threads are only 
created if the work queue is full. To that end, {{corePoolSize}} is defaulted 
to whatever the value of {{concurrent_validations}} is. In turn, this defaults 
to the value of {{concurrent_compactors}}, but can be overridden. To guard 
against accidentally configuring this way too high (which some existing 
clusters may do as previously {{{{concurrent_validations}}}} had limited 
effect), it's capped to the value of {{{{concurrent_compactors}}}}. This safety 
check can be disabled via a system property at startup, or JMX on a running 
instance.

The previous behaviour, use of a {{SynchronousQueue}} and {{corePoolSize}} of 
1, is maintained if required. A new yaml option 
{{validation_pool_full_strategy}} controls this, with options {{queue}} & 
{{block}}.

This branch also makes a similar change to the repair command pool in 
{{ActiveRepairService}}. When {{repair_pool_full_strategy}} was set to 
{{queue}}, a {{LinkedBlockingQueue}} is used for the work queue, but 
{{corePoolSize}} is always set to 1. As the work queue is unbounded, no 
addition threads will be created, giving effectivly single-threaded behaviour.

The last this is to also fix the timeout for {{PREPARE}} messages, which was 
shortened from 1 hour to {{rpc_timeout}} in CASSANDRA-9292, but it seems it was 
inadvertently reset when CASSANDRA-13397 was merged.
||branch||utests||in-jvm dtests||dtests_with_vnodes||dtests_no_vnodes||
|[15812-trunk|https://github.com/beobal/cassandra/tree/15812-trunk]|[jdk8|https://circleci.com/gh/beobal/cassandra/1426],
 
[jdk11|https://circleci.com/gh/beobal/cassandra/1430]|[jdk8|https://circleci.com/gh/beobal/cassandra/1427],
 
[jdk11|https://circleci.com/gh/beobal/cassandra/1425]|[jdk8|https://circleci.com/gh/beobal/cassandra/1431],
 
[jdk11|https://circleci.com/gh/beobal/cassandra/1428]|[jdk8|https://circleci.com/gh/beobal/cassandra/1432],
 [jdk11|https://circleci.com/gh/beobal/cassandra/1429]|

I've looked at the dtest failures and the failing pytests appear to be flakey 
on trunk and/or being addressed by specific JIRAs. The exception is 
{{repair_tests.repair_test.py::TestRepair::test_dead_sync_initiator}}, I'm 
unable to get a failure from that locally, but I haven't really dug into it yet.
 The one in-jvm dtest also seems to have had a few failures on trunk recently, 
so I think that's unrelated.


> Submitting Validation requests can block ANTI_ENTROPY stage 
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-15812
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15812
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Repair
>            Reporter: Sam Tunnicliffe
>            Assignee: Sam Tunnicliffe
>            Priority: Normal
>             Fix For: 4.0-alpha
>
>
>  RepairMessages are handled on Stage.ANTI_ENTROPY, which has a thread pool 
> with core/max capacity of one, ie. we can only process one message at a time. 
>  
> Scheduling validation compactions may however block the stage completely, by 
> blocking on CompactionManager's ValidationExecutor while submitting a new 
> validation compaction, in cases where there are already more validations 
> running than can be executed in parallel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to