[
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Semb Wever updated CASSANDRA-15922:
-------------------------------------------
Attachment: Screen Shot 2020-07-06 at 13.26.10.png
> High CAS failures in NativeAllocator.Region.allocate(..)
> ---------------------------------------------------------
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
> Issue Type: Bug
> Components: Local/Memtable
> Reporter: Michael Semb Wever
> Assignee: Michael Semb Wever
> Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05
> at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot
> 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen
> Shot 2020-07-05 at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png,
> Screen Shot 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at
> 13.26.10.png, profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}}
> for the current offset in the region. Allocations depends on a
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
> - {{memtable_allocation_type: offheap_objects}}
> - {{memtable_offheap_space_in_mb: 5120}}
> - {{concurrent_writes: 160}}
> Numerous flamegraphs demonstrate the problem. See attached
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.
> Code wise this is relatively easy to do, for example replacing
> NativeAllocator:59
> {code}private final AtomicReference<Region> currentRegion = new
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal<AtomicReference<Region>> currentRegion = new
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more
> than concurrent_writes number of Regions in use at any one time. For example
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB.
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm
> that a) reduces CAS failures in high contention environments, b) doesn't
> impact normal environments, and c) keeps the allocation strategy of using one
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800]
> describes this contention CAS problem and demonstrates a number of algorithms
> to apply. The simplest of these algorithms is the Constant Backoff CAS
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated.
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has
> also the {{casFailures}} field added. The following two screenshots are from
> data collected from this class on a 6 CPU (12 core) MBP, running the
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of
> a Region (over ~215 millions allocations), using different threads and park
> times. This illustrates the improvement (reduction) of CAS failures from zero
> park time, through orders of magnitude, up to 10000000ns (10ms). The biggest
> improvement is from no algorithm to a park time of 1ns where CAS failures are
> ~two orders of magnitude lower. From a park time 10μs and higher there is a
> significant drop also at low contention rates.
> !Screen Shot 2020-07-05 at 13.16.10.png|width=500px!
> This attached screenshot shows the time it takes to fill a Region (~215
> million allocations), using different threads and park times. The biggest
> improvement is from no algorithm to a park time of 1ns where performance is
> one order of magnitude faster. From a park time of 100μs and higher there is
> a even further significant drop, especially at low contention rates.
> !Screen Shot 2020-07-05 at 13.26.17.png|width=500px!
> Repeating the test run show reliably similar results: [^Screen Shot
> 2020-07-05 at 13.37.01.png] and [^Screen Shot 2020-07-05 at 13.35.55.png].
> h4. Region Per Thread Experiments
> Implementing Region Per Thread: see the
> {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect
> zero CAS failures of the life of a Region. For performance we see two orders
> of magnitude lower times to fill up the Region (~420ms).
> !Screen Shot 2020-07-05 at 13.48.16.png|width=200px!
> h4. Costs
> Region per Thread is an unrealistic solution as it introduces many new issues
> and problems, from increased memory use to leaking memory and GC issues. It
> is better tackled as part of a TPC implementation.
> The backoff approach is simple and elegant, and seems to improve throughput
> in all situations. It does introduce context switches which may impact
> throughput in some busy throughput scenarios, so this should to be tested
> further.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]