[
https://issues.apache.org/jira/browse/CASSANDRA-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sam Tunnicliffe updated CASSANDRA-15812:
----------------------------------------
Test and Documentation Plan: new unit tests added, existing dtests modified
Status: Patch Available (was: In Progress)
I've pushed a branch
[here|https://github.com/beobal/cassandra/tree/15812-trunk] with a fix for
this, along with a couple of minor follow up commits.
The main fix is to switch the work queue in {{ValidationExecutor}} to a
{{LinkedBlockingQueue}}, rather than a {{SynchronousQueue}}. When using the
latter, the executor will spawn new threads until the max pool size is reached,
but then block the caller until capacity becomes available. Using {{LBQ}} will
allow additional tasks to be queued but also requires {{corePoolSize}} to be
set appropriately as once that threshold is reached, new threads are only
created if the work queue is full. To that end, {{corePoolSize}} is defaulted
to whatever the value of {{concurrent_validations}} is. In turn, this defaults
to the value of {{concurrent_compactors}}, but can be overridden. To guard
against accidentally configuring this way too high (which some existing
clusters may do as previously {{{{concurrent_validations}}}} had limited
effect), it's capped to the value of {{{{concurrent_compactors}}}}. This safety
check can be disabled via a system property at startup, or JMX on a running
instance.
The previous behaviour, use of a {{SynchronousQueue}} and {{corePoolSize}} of
1, is maintained if required. A new yaml option
{{validation_pool_full_strategy}} controls this, with options {{queue}} &
{{block}}.
This branch also makes a similar change to the repair command pool in
{{ActiveRepairService}}. When {{repair_pool_full_strategy}} was set to
{{queue}}, a {{LinkedBlockingQueue}} is used for the work queue, but
{{corePoolSize}} is always set to 1. As the work queue is unbounded, no
addition threads will be created, giving effectivly single-threaded behaviour.
The last this is to also fix the timeout for {{PREPARE}} messages, which was
shortened from 1 hour to {{rpc_timeout}} in CASSANDRA-9292, but it seems it was
inadvertently reset when CASSANDRA-13397 was merged.
||branch||utests||in-jvm dtests||dtests_with_vnodes||dtests_no_vnodes||
|[15812-trunk|https://github.com/beobal/cassandra/tree/15812-trunk]|[jdk8|https://circleci.com/gh/beobal/cassandra/1426],
[jdk11|https://circleci.com/gh/beobal/cassandra/1430]|[jdk8|https://circleci.com/gh/beobal/cassandra/1427],
[jdk11|https://circleci.com/gh/beobal/cassandra/1425]|[jdk8|https://circleci.com/gh/beobal/cassandra/1431],
[jdk11|https://circleci.com/gh/beobal/cassandra/1428]|[jdk8|https://circleci.com/gh/beobal/cassandra/1432],
[jdk11|https://circleci.com/gh/beobal/cassandra/1429]|
I've looked at the dtest failures and the failing pytests appear to be flakey
on trunk and/or being addressed by specific JIRAs. The exception is
{{repair_tests.repair_test.py::TestRepair::test_dead_sync_initiator}}, I'm
unable to get a failure from that locally, but I haven't really dug into it yet.
The one in-jvm dtest also seems to have had a few failures on trunk recently,
so I think that's unrelated.
> Submitting Validation requests can block ANTI_ENTROPY stage
> ------------------------------------------------------------
>
> Key: CASSANDRA-15812
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15812
> Project: Cassandra
> Issue Type: Bug
> Components: Consistency/Repair
> Reporter: Sam Tunnicliffe
> Assignee: Sam Tunnicliffe
> Priority: Normal
> Fix For: 4.0-alpha
>
>
> RepairMessages are handled on Stage.ANTI_ENTROPY, which has a thread pool
> with core/max capacity of one, ie. we can only process one message at a time.
>
> Scheduling validation compactions may however block the stage completely, by
> blocking on CompactionManager's ValidationExecutor while submitting a new
> validation compaction, in cases where there are already more validations
> running than can be executed in parallel.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]