Michael Semb Wever created CASSANDRA-15922:
----------------------------------------------

             Summary: High CAS failures in NativeAllocator.Region.allocate(..) 
                 Key: CASSANDRA-15922
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
             Project: Cassandra
          Issue Type: Bug
          Components: Local/Memtable
            Reporter: Michael Semb Wever
            Assignee: Michael Semb Wever
         Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 
13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 at 
13.48.16.png, profile_pbdpc23zafsrh_20200702.svg

h6. Problem

The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
for the current offset in the region. Allocations depends on a 
{{.compareAndSet(..)}} call.

In highly contended environments the CAS failures can be high, starving writes 
in a running Cassandra node.

h6. Example

CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from 
the CAS failures, in nodes during heavy spark analytics write load.

These nodes: 40 CPU cores and 256GB ram; have relevant settings
 - {{memtable_allocation_type: offheap_objects}}
 - {{memtable_offheap_space_in_mb: 5120}}
 - {{concurrent_writes: 160}}

Numerous flamegraphs demonstrate the problem. See attached.

h6. Suggestion: ThreadLocal Regions

One possible solution is to have separate Regions per thread.  
Code wise this is relatively easy to do, for example replacing 
NativeAllocator:59 
{code}private final AtomicReference<Region> currentRegion = new 
AtomicReference<>();{code}
with
{code}private final ThreadLocal<AtomicReference<Region>> currentRegion = new 
AtomicReference<>() {...};{code}

But this approach substantially changes the allocation behaviour, with more 
than concurrent_writes number of Regions in use at any one time. 

h6. Suggestion: Simple Contention Management Algorithm (Constant Backoff)

Another possible solution is to introduce a contention management algorithm 
that a) reduces CAS failures in high contention environments, b) doesn't impact 
normal environments, and c) keeps the allocation strategy of using one region 
at a time.

The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes 
this contention CAS problem and demonstrates a number of algorithms to apply. 
The simplest of these algorithms is the Constant Backoff CAS Algorithm.

Applying the Constant Backoff CAS Algorithm involves adding one line of code to 
{{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
number) nanoseconds after a CAS failure occurs.
That is...
{code}
                // we raced and lost alloc, try again
                LockSupport.parkNanos(1);
{code}

h6. Constant Backoff CAS Algorithm Experiments

Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 

In the attached {{NativeAllocatorRegionTest}} class, which can be run 
standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
also the {{casFailures}} field added. The following two screenshots are from 
data collected from this class on a 6 CPU (12 core) MBP, running the 
{{NativeAllocatorRegionTest.testRegionCAS}} method.

This attached screenshot shows the number of CAS failures during the life of a 
Region (over ~215 millions allocations), using different threads and park 
times. This illustrates the improvement (reduction) of CAS failures from zero 
park time, through orders of magnitude, up to 10000000ns (10ms). The biggest 
improvement is from no algorithm to a sleep of 1ns where CAS failures are ~two 
orders of magnitude lower. From 10μs there is a significant drop also at low 
contention rates.

This attached screenshot shows the time it takes to fill a Region (~215 
millions allocations), using different threads and park times. The biggest 
improvement is from no algorithm to a sleep of 1ns where performance is one 
orders of magnitude faster. From 100μs there is a even further significant 
drop, especially at low contention rates.

Repeating the test run show reliably similar results: screenshot1 and 
screenshot2.

h6. Region Per Thread Experiments

Implementing Region Per Thread: see the 
{{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero 
CAS failures of the life of a Region. For performance we see two orders of 
magnitude lower times to fill up the Region (~420ms).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to