[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15922:
-------------------------------------------
    Status: Ready to Commit  (was: Review In Progress)

> High CAS failures in NativeAllocator.Region.allocate(..) 
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-15922
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Memtable
>            Reporter: Michael Semb Wever
>            Assignee: Michael Semb Wever
>            Priority: Normal
>             Fix For: 4.0, 3.0.x, 3.11.x
>
>         Attachments: NativeAllocatorRegion2Test.java, 
> NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, 
> Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
> 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 
> at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot 
> 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference<Region> currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal<AtomicReference<Region>> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
>     // we raced and lost alloc, try again
>     LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 million allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 10000000ns (10ms). The biggest 
> improvement is from no algorithm to a park time of 1ns where CAS failures are 
> ~two orders of magnitude lower. From a park time 10μs and higher there is a 
> significant drop also at low contention rates.
>  !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! 
> This attached screenshot shows the time it takes to fill a Region (~215 
> million allocations), using different threads and park times. The biggest 
> improvement is from no algorithm to a park time of 1ns where performance is 
> one order of magnitude faster. From a park time of 100μs and higher there is 
> a even further significant drop, especially at low contention rates.
>  !Screen Shot 2020-07-05 at 13.26.17.png|width=500px! 
> Repeating the test run show reliably similar results:  [^Screen Shot 
> 2020-07-05 at 13.37.01.png]  and  [^Screen Shot 2020-07-05 at 13.35.55.png].
> h4. Region Per Thread Experiments
> Implementing Region Per Thread: see the 
> {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect 
> zero CAS failures of the life of a Region. For performance we see two orders 
> of magnitude lower times to fill up the Region (~420ms).
>  !Screen Shot 2020-07-05 at 13.48.16.png|width=200px! 
> h4. Costs
> Region per Thread is an unrealistic solution as it introduces many new issues 
> and problems, from increased memory use to leaking memory and GC issues. It 
> is better tackled as part of a TPC implementation.
> The backoff approach is simple and elegant, and seems to improve throughput 
> in all situations. It does introduce context switches which may impact 
> throughput in some busy throughput scenarios, so this should to be tested 
> further.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to