Benedict Elliott Smith created CASSANDRA-20069:
--------------------------------------------------
Summary: Improve Accord Work Queueing (and misc perf improvements)
Key: CASSANDRA-20069
URL: https://issues.apache.org/jira/browse/CASSANDRA-20069
Project: Cassandra
Issue Type: Improvement
Reporter: Benedict Elliott Smith
Work can now be cancelled, and commands serving remote requests now cancel
queued work if they time out. This prevents run-away work growth, as work does
not outlive its useful lifespan.
CommandStores, threads and caches are now only loosely coupled, with it
possible to independently tune:
* the total number of threads
* the number of work queue/cache units we distribute the threads amongst
* the number of CommandStores per table/shard (which are distributed between
queue/cache units, the threads of which will execute CommandStore work)
Mutual exclusivity is managed separately for the queue/cache unit and each
CommandStore, and the locks are held only for as long as necessary so we can
have multiple threads servicing the same CommandStore(s). Given this threading
model, it is now possible for Accord threads to perform all of the
loading/saving work, reducing the queueing delay and context switching costs -
and this is also configurable. The default configuration is now to do this work
on the Accord work pool (as already the case with Paxos for most IO). Accord
state reads can be scheduled and completed from any thread, so that we do not
incur multiple queue delays when preparing work for a command store. There are
further improvements that can be made here to permit the event loop to answer
in-cache queries, and under a future async-io model to directly submit read
requests.
Misc perf improvements:
* Write directly to Memtable for cache evictions, using putIfAbsent immediately
* Use LCS on CommandsForKey table for faster reads
* Flatten UUID fields into TableId to reduce indirection for comparisons
* Introduce asymmetric comparisons to BTreeMap for faster schema lookups
* Read TimestampsForKey directly to avoid parsing CQL
* Save summary information in RedundantBefore to short-circuit executions
* Send Stable message only as necessary on Execute, to reduce load on replicas
* Ensure journal entries are immediately visible to replay without handing
over to another thread, so on normal path can avoid context switch and listener
overheads
* Journal periodic mode should fsync only as necessary
* Use OpOrder to guard Journal Segment read access (avoiding taking individual
references, which can be costly)
* AccordCache can “shrink” (serialise) entries instead of evicting, to
increase effective capacity (evicting any already-shrunk entries that are
encountered)
* EphemeralRead cache entries are evicted only on timeout of the remote
request, and are not otherwise persisted
* Work is scheduled by first arrival time, not by read completion time - work
that is slow to read jumps the queue once data is in memory to serve it, to
reduce latency variability
* Some operations may now partially execute without waiting for all state to
be brought into memory
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]