[
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Semb Wever updated CASSANDRA-15922:
-------------------------------------------
Description:
h4. Problem
The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}}
for the current offset in the region. Allocations depends on a
{{.compareAndSet(..)}} call.
In highly contended environments the CAS failures can be high, starving writes
in a running Cassandra node.
h4. Example
It has been witnessed up to 33% of CPU time stuck in the
{{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during a
heavy spark analytics write load.
These nodes: 40 CPU cores and 256GB ram; have relevant settings
- {{memtable_allocation_type: offheap_objects}}
- {{memtable_offheap_space_in_mb: 5120}}
- {{concurrent_writes: 160}}
Numerous flamegraphs demonstrate the problem. See attached
[^profile_pbdpc23zafsrh_20200702.svg].
h4. Suggestion: ThreadLocal Regions
One possible solution is to have separate Regions per thread.
Code wise this is relatively easy to do, for example replacing
NativeAllocator:59
{code}private final AtomicReference<Region> currentRegion = new
AtomicReference<>();{code}
with
{code}private final ThreadLocal<AtomicReference<Region>> currentRegion = new
ThreadLocal<>() {...};{code}
But this approach substantially changes the allocation behaviour, with more
than concurrent_writes number of Regions in use at any one time. For example
with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB.
h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
Another possible solution is to introduce a contention management algorithm
that a) reduces CAS failures in high contention environments, b) doesn't impact
normal environments, and c) keeps the allocation strategy of using one region
at a time.
The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes
this contention CAS problem and demonstrates a number of algorithms to apply.
The simplest of these algorithms is the Constant Backoff CAS Algorithm.
Applying the Constant Backoff CAS Algorithm involves adding one line of code to
{{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant
number) nanoseconds after a CAS failure occurs.
That is...
{code}
// we raced and lost alloc, try again
LockSupport.parkNanos(1);
{code}
h4. Constant Backoff CAS Algorithm Experiments
Using the code attached in NativeAllocatorRegionTest.java the concurrency and
CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated.
In the attached [^NativeAllocatorRegionTest.java] class, which can be run
standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has
also the {{casFailures}} field added. The following two screenshots are from
data collected from this class on a 6 CPU (12 core) MBP, running the
{{NativeAllocatorRegionTest.testRegionCAS}} method.
This attached screenshot shows the number of CAS failures during the life of a
Region (over ~215 millions allocations), using different threads and park
times. This illustrates the improvement (reduction) of CAS failures from zero
park time, through orders of magnitude, up to 10000000ns (10ms). The biggest
improvement is from no algorithm to a park time of 1ns where CAS failures are
~two orders of magnitude lower. From a park time 10μs and higher there is a
significant drop also at low contention rates.
!Screen Shot 2020-07-05 at 13.16.10.png|width=500px!
This attached screenshot shows the time it takes to fill a Region (~215 million
allocations), using different threads and park times. The biggest improvement
is from no algorithm to a park time of 1ns where performance is one order of
magnitude faster. From a park time of 100μs and higher there is a even further
significant drop, especially at low contention rates.
!Screen Shot 2020-07-05 at 13.26.17.png|width=500px!
Repeating the test run show reliably similar results: [^Screen Shot 2020-07-05
at 13.37.01.png] and [^Screen Shot 2020-07-05 at 13.35.55.png].
h4. Region Per Thread Experiments
Implementing Region Per Thread: see the
{{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero
CAS failures of the life of a Region. For performance we see two orders of
magnitude lower times to fill up the Region (~420ms).
!Screen Shot 2020-07-05 at 13.48.16.png|width=200px!
h4. Costs
Region per Thread is an unrealistic solution as it introduces many new issues
and problems, from increased memory use to leaking memory and GC issues. It is
better tackled as part of a TPC implementation.
The backoff approach is simple and elegant, and seems to improve throughput in
all situations. It does introduce context switches which may impact throughput
in some busy throughput scenarios, so this should to be tested further.
was:
h4. Problem
The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}}
for the current offset in the region. Allocations depends on a
{{.compareAndSet(..)}} call.
In highly contended environments the CAS failures can be high, starving writes
in a running Cassandra node.
h4. Example
It has been witnessed up to 33% of CPU time stuck in the
{{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during a
heavy spark analytics write load.
These nodes: 40 CPU cores and 256GB ram; have relevant settings
- {{memtable_allocation_type: offheap_objects}}
- {{memtable_offheap_space_in_mb: 5120}}
- {{concurrent_writes: 160}}
Numerous flamegraphs demonstrate the problem. See attached
[^profile_pbdpc23zafsrh_20200702.svg].
h4. Suggestion: ThreadLocal Regions
One possible solution is to have separate Regions per thread.
Code wise this is relatively easy to do, for example replacing
NativeAllocator:59
{code}private final AtomicReference<Region> currentRegion = new
AtomicReference<>();{code}
with
{code}private final ThreadLocal<AtomicReference<Region>> currentRegion = new
ThreadLocal<>() {...};{code}
But this approach substantially changes the allocation behaviour, with more
than concurrent_writes number of Regions in use at any one time. For example
with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB.
h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
Another possible solution is to introduce a contention management algorithm
that a) reduces CAS failures in high contention environments, b) doesn't impact
normal environments, and c) keeps the allocation strategy of using one region
at a time.
The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes
this contention CAS problem and demonstrates a number of algorithms to apply.
The simplest of these algorithms is the Constant Backoff CAS Algorithm.
Applying the Constant Backoff CAS Algorithm involves adding one line of code to
{{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant
number) nanoseconds after a CAS failure occurs.
That is...
{code}
// we raced and lost alloc, try again
LockSupport.parkNanos(1);
{code}
h4. Constant Backoff CAS Algorithm Experiments
Using the code attached in NativeAllocatorRegionTest.java the concurrency and
CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated.
In the attached [^NativeAllocatorRegionTest.java] class, which can be run
standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has
also the {{casFailures}} field added. The following two screenshots are from
data collected from this class on a 6 CPU (12 core) MBP, running the
{{NativeAllocatorRegionTest.testRegionCAS}} method.
This attached screenshot shows the number of CAS failures during the life of a
Region (over ~215 millions allocations), using different threads and park
times. This illustrates the improvement (reduction) of CAS failures from zero
park time, through orders of magnitude, up to 10000000ns (10ms). The biggest
improvement is from no algorithm to a sleep of 1ns where CAS failures are ~two
orders of magnitude lower. From a park time 10μs and higher there is a
significant drop also at low contention rates.
!Screen Shot 2020-07-05 at 13.16.10.png|width=500px!
This attached screenshot shows the time it takes to fill a Region (~215 million
allocations), using different threads and park times. The biggest improvement
is from no algorithm to a sleep of 1ns where performance is one order of
magnitude faster. From a park time of 100μs and higher there is a even further
significant drop, especially at low contention rates.
!Screen Shot 2020-07-05 at 13.26.17.png|width=500px!
Repeating the test run show reliably similar results: [^Screen Shot 2020-07-05
at 13.37.01.png] and [^Screen Shot 2020-07-05 at 13.35.55.png].
h4. Region Per Thread Experiments
Implementing Region Per Thread: see the
{{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero
CAS failures of the life of a Region. For performance we see two orders of
magnitude lower times to fill up the Region (~420ms).
!Screen Shot 2020-07-05 at 13.48.16.png|width=200px!
h4. Costs
Region per Thread is an unrealistic solution as it introduces many new issues
and problems, from increased memory use to leaking memory and GC issues. It is
better tackled as part of a TPC implementation.
The backoff approach is simple and elegant, and seems to improve throughput in
all situations. It does introduce context switches which may impact throughput
in some busy throughput scenarios, so this should to be tested further.
> High CAS failures in NativeAllocator.Region.allocate(..)
> ---------------------------------------------------------
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
> Issue Type: Bug
> Components: Local/Memtable
> Reporter: Michael Semb Wever
> Assignee: Michael Semb Wever
> Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05
> at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot
> 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen
> Shot 2020-07-05 at 13.48.16.png, profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}}
> for the current offset in the region. Allocations depends on a
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
> - {{memtable_allocation_type: offheap_objects}}
> - {{memtable_offheap_space_in_mb: 5120}}
> - {{concurrent_writes: 160}}
> Numerous flamegraphs demonstrate the problem. See attached
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.
> Code wise this is relatively easy to do, for example replacing
> NativeAllocator:59
> {code}private final AtomicReference<Region> currentRegion = new
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal<AtomicReference<Region>> currentRegion = new
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more
> than concurrent_writes number of Regions in use at any one time. For example
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB.
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm
> that a) reduces CAS failures in high contention environments, b) doesn't
> impact normal environments, and c) keeps the allocation strategy of using one
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800]
> describes this contention CAS problem and demonstrates a number of algorithms
> to apply. The simplest of these algorithms is the Constant Backoff CAS
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated.
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has
> also the {{casFailures}} field added. The following two screenshots are from
> data collected from this class on a 6 CPU (12 core) MBP, running the
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of
> a Region (over ~215 millions allocations), using different threads and park
> times. This illustrates the improvement (reduction) of CAS failures from zero
> park time, through orders of magnitude, up to 10000000ns (10ms). The biggest
> improvement is from no algorithm to a park time of 1ns where CAS failures are
> ~two orders of magnitude lower. From a park time 10μs and higher there is a
> significant drop also at low contention rates.
> !Screen Shot 2020-07-05 at 13.16.10.png|width=500px!
> This attached screenshot shows the time it takes to fill a Region (~215
> million allocations), using different threads and park times. The biggest
> improvement is from no algorithm to a park time of 1ns where performance is
> one order of magnitude faster. From a park time of 100μs and higher there is
> a even further significant drop, especially at low contention rates.
> !Screen Shot 2020-07-05 at 13.26.17.png|width=500px!
> Repeating the test run show reliably similar results: [^Screen Shot
> 2020-07-05 at 13.37.01.png] and [^Screen Shot 2020-07-05 at 13.35.55.png].
> h4. Region Per Thread Experiments
> Implementing Region Per Thread: see the
> {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect
> zero CAS failures of the life of a Region. For performance we see two orders
> of magnitude lower times to fill up the Region (~420ms).
> !Screen Shot 2020-07-05 at 13.48.16.png|width=200px!
> h4. Costs
> Region per Thread is an unrealistic solution as it introduces many new issues
> and problems, from increased memory use to leaking memory and GC issues. It
> is better tackled as part of a TPC implementation.
> The backoff approach is simple and elegant, and seems to improve throughput
> in all situations. It does introduce context switches which may impact
> throughput in some busy throughput scenarios, so this should to be tested
> further.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]