[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)

2020-07-06 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152010#comment-17152010
 ] 

Benedict Elliott Smith commented on CASSANDRA-15922:


+1

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegion2Test.java, 
> NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, 
> Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
> 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 
> at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot 
> 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 1000ns (10ms). The biggest 
> improvement is from no algorithm to a park time of 1ns where CAS failures are 
> ~two orders of magnitude lower. From a park time 10μs and higher there is a 
> significant drop also at low contention rates.
>  !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! 
> This attached screenshot shows the time it takes to fill a Region (~215 
> million allocations), using different threads and park times. The biggest 
> improvement is from no algorithm to a park time of 1ns where performance is 
> one order of magnitude faster. From a park time of 100μs and higher there is 
> a 

[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)

2020-07-06 Thread Robert Stupp (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152006#comment-17152006
 ] 

Robert Stupp commented on CASSANDRA-15922:
--

+1 (assuming CI looks good and 3.11+3.0 back-ports are clean)

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegion2Test.java, 
> NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, 
> Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
> 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 
> at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot 
> 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 1000ns (10ms). The biggest 
> improvement is from no algorithm to a park time of 1ns where CAS failures are 
> ~two orders of magnitude lower. From a park time 10μs and higher there is a 
> significant drop also at low contention rates.
>  !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! 
> This attached screenshot shows the time it takes to fill a Region (~215 
> million allocations), using different threads and park times. The biggest 
> improvement is from no algorithm to a park time of 1ns where performance is 
> one order of magnitude faster. From a park 

[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)

2020-07-06 Thread Michael Semb Wever (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151993#comment-17151993
 ] 

Michael Semb Wever commented on CASSANDRA-15922:


Patched updated to
- use {{getAndAdd(..)}} instead of {{addAndGet(..)}} for readability
 - remove the {{allocCount}} AtomicInteger field
 - don't print negative waste values the {{toString(..)}} method (when region 
is full and nextFreeOffset is passed capacity)

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegion2Test.java, 
> NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, 
> Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
> 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 
> at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot 
> 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 1000ns (10ms). The biggest 
> improvement is from no algorithm to a park time of 1ns where CAS failures are 
> ~two orders of magnitude lower. From a park time 10μs and higher there is a 
> significant drop also at low contention rates.
>  !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! 
> This attached screenshot shows the time it takes to fill a 

[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)

2020-07-06 Thread Michael Semb Wever (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151970#comment-17151970
 ] 

Michael Semb Wever commented on CASSANDRA-15922:


h4. {{addAndGet}} Experiments

Code patch at 
https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1
This patch depends on the {{addAndGet(..)}} call guaranteeing a (serial) result 
that returns the value from no overlapping/latter add calls. AFAIK that is how 
AtomicInteger works.

I'm also curious if we still need the {{allocCount}} AtomicInteger field, it 
appears to be there only for debug. May I remove it in this patch?

Benchmark code attached in  [^NativeAllocatorRegion2Test.java].

The following attached screenshot shows the time it takes to fill a Region 
(~215 million allocations), using different threads, comparing the original 
code (compareAndSet), the addAndGet, and the constant backoff (parkNano) 
approaches. 

The biggest improvement is still the algorithm with a park time of 1ns where 
performance is one order of magnitude faster. The addAndGet approach is 2x to 
5x faster than the original. As mentioned above it also comes with the benefit 
of no-loop (no starvation) and faster performance in all workloads.

 !Screen Shot 2020-07-06 at 13.26.10.png|width=600px! 

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegion2Test.java, 
> NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, 
> Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
> 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 
> at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot 
> 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; 

[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)

2020-07-06 Thread Robert Stupp (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151969#comment-17151969
 ] 

Robert Stupp commented on CASSANDRA-15922:
--

+1 on {{addAndGet}} (or {{getAndAdd}}, whichever works best).

And I agree, the allocation-model that we currently have is not great, but as 
you said, it's a ton of work to get it right (less (ideally no) fragmentation, 
no unnecessary tiny allocations, no unnecessary copying, etc etc).

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegion2Test.java, 
> NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, 
> Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
> 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 
> at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot 
> 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 1000ns (10ms). The biggest 
> improvement is from no algorithm to a park time of 1ns where CAS failures are 
> ~two orders of magnitude lower. From a park time 10μs and higher there is a 
> significant drop also at low contention rates.
>  !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! 
> This attached screenshot shows the time it takes 

[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)

2020-07-06 Thread Michael Semb Wever (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151952#comment-17151952
 ] 

Michael Semb Wever commented on CASSANDRA-15922:


bq.  I assume in this case one of the problems is that we are allocating huge 
numbers of small objects, so that a small number of threads are competing 
over-and-over again to allocate the same data. We should not be competing for 
each Cell allocation, and instead try to allocate all the buffers for e.g. at 
least a Row at once. 

This is correct. Rows with ~thousands of double cells.

bq. There is perhaps a better alternative: use addAndGet->if instead of 
read->if->compareAndSet, i.e. unconditionally update the pointer, then 
determine whether or not you successfully allocated in the aftermath. This is 
guaranteed to succeed in one step; contention can slow that step down modestly, 
but there is no wasted competition.

Sounds good. Will put it together and test.

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 
> at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 
> 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen 
> Shot 2020-07-05 at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, 
> Screen Shot 2020-07-06 at 11.36.44.png, profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of 

[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)

2020-07-06 Thread Benedict Elliott Smith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151940#comment-17151940
 ] 

Benedict Elliott Smith commented on CASSANDRA-15922:


There is perhaps a better alternative: use {{addAndGet->if}} instead of 
{{read->if->compareAndSet}}, i.e. unconditionally update the pointer, then 
determine whether or not you successfully allocated in the aftermath.  This is 
guaranteed to succeed in one step; contention can slow that step down modestly, 
but there is no wasted competition.

There is no downside to this approach with the {{NativeAllocator}}, either, 
since if we fail to allocate we always swap the {{Region}}, so consuming more 
than we need when smaller allocations may have been possible is not a problem.  
So we should have made this change a long time ago, really.  

It _might_ be that this approach still sees some slowdown: I assume in this 
case one of the problems is that we are allocating huge numbers of small 
objects, so that a small number of threads are competing over-and-over again to 
allocate the same data.  We should not be competing for each {{Cell}} 
alloation, and instead try to allocate all the buffers for e.g. at least a 
{{Row}} at once.  But this is more involved.  Ideally we would improve the 
allocator itself, which is very under-engineered, but with our threading model 
that's more challenging than we might like.

The _upside_ to this approach is that ordinary workloads should be _improved_, 
and there is no possibility of thread starvation.

The current proposal by contrast introduces much longer windows for thread 
starvation, and _might_ negatively impact tail latencies.  This is a very 
difficult thing for us to rule out, so the work required to demonstrate it is 
performance neutral could be prohibitive.

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 
> at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 
> 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen 
> Shot 2020-07-05 at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, 
> Screen Shot 2020-07-06 at 11.36.44.png, profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we 

[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)

2020-07-06 Thread Michael Semb Wever (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151869#comment-17151869
 ] 

Michael Semb Wever commented on CASSANDRA-15922:


bq. Although the same change probably needs to be applied to 
org.apache.cassandra.utils.memory.SlabAllocator.Region#allocate as well.

Added to patch.

bq. there's a slight issue in the attached NativeAllocatorRegionTest.java 
Region.allocate() method that adds another CAS (casFailures) to every failed 
CAS against nextFreeOffset. It's probably better to count the number of failed 
CAS's in a local variable and add it to this.casFailures when the test's 
Region.allocate() returns.

Fixed and re-running tests. Thanks [~snazy].

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 
> at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 
> 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen 
> Shot 2020-07-05 at 13.48.16.png, profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 1000ns (10ms). The biggest 
> improvement is from no algorithm to a park time of 1ns where CAS failures are 
> ~two orders of magnitude lower. From a park time 10μs and higher there is a 
> significant drop also at low contention rates.
>  !Screen Shot 

[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)

2020-07-06 Thread Robert Stupp (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151832#comment-17151832
 ] 

Robert Stupp commented on CASSANDRA-15922:
--

Talked to Mick a bit about this offline. The demonstrated effects (in the 
attached charts and SVG) are IMO "good enough" to justify the change.

However, there's a slight issue in the attached 
{{NativeAllocatorRegionTest.java}} {{Region.allocate()}} method that adds 
another CAS ({{casFailures}}) to every failed CAS against {{nextFreeOffset}}. 
It's probably better to count the number of failed CAS's in a local variable 
and add it to {{this.casFailures}} when the test's {{Region.allocate()}} 
returns.

I think, the proposed solution here to add a constant 
{{LockSupport.sleepNanos(1)}} is fine and "non-intrusive enough". Although the 
same change probably needs to be applied to 
{{org.apache.cassandra.utils.memory.SlabAllocator.Region#allocate}} as well.


> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 
> at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 
> 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen 
> Shot 2020-07-05 at 13.48.16.png, profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 1000ns (10ms). The biggest 

[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)

2020-07-05 Thread Michael Semb Wever (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151560#comment-17151560
 ] 

Michael Semb Wever commented on CASSANDRA-15922:


Patch for CAS Backoff Contention Management Algorithm at 
https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922


> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 
> at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 
> 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen 
> Shot 2020-07-05 at 13.48.16.png, profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, 
> from the CAS failures, has been witnessed in nodes during a heavy spark 
> analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> AtomicReference<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB.
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the  [^NativeAllocatorRegionTest.java] class, which can be run standalone, 
> the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the 
> {{casFailures}} field added. The following two screenshots are from data 
> collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 1000ns (10ms). The biggest 
> improvement is from no algorithm to a sleep of 1ns where CAS failures are 
> ~two orders of magnitude lower. From 10μs there is a significant drop also at 
> low contention rates.
>  !Screen Shot 2020-07-05 at 13.16.10.png|width=400px! 
> This attached screenshot shows the time it takes to fill a Region (~215 
> millions allocations), using different threads and park times. The biggest 
> improvement is from no algorithm to a sleep of 1ns where performance is one 
> orders of magnitude faster. From 100μs there is a even further significant 
> drop, especially at low contention rates.
>  !Screen Shot 2020-07-05 at