[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Description: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example It has been witnessed up to 33% of CPU time stuck in the {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached [^profile_pbdpc23zafsrh_20200702.svg]. h4. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicReference currentRegion = new AtomicReference<>();{code} with {code}private final ThreadLocal> currentRegion = new ThreadLocal<>() {...};{code} But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. For example with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time. The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm. Applying the Constant Backoff CAS Algorithm involves adding one line of code to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant number) nanoseconds after a CAS failure occurs. That is... {code} // we raced and lost alloc, try again LockSupport.parkNanos(1); {code} h4. Constant Backoff CAS Algorithm Experiments Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. In the attached [^NativeAllocatorRegionTest.java] class, which can be run standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the {{casFailures}} field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the {{NativeAllocatorRegionTest.testRegionCAS}} method. This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 millions allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 1000ns (10ms). The biggest improvement is from no algorithm to a park time of 1ns where CAS failures are ~two orders of magnitude lower. From a park time 10μs and higher there is a significant drop also at low contention rates. !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! This attached screenshot shows the time it takes to fill a Region (~215 million allocations), using different threads and park times. The biggest improvement is from no algorithm to a park time of 1ns where performance is one order of magnitude faster. From a park time of 100μs and higher there is a even further significant drop, especially at low contention rates. !Screen Shot 2020-07-05 at 13.26.17.png|width=500px! Repeating the test run show reliably similar results: [^Screen Shot 2020-07-05 at 13.37.01.png] and [^Screen Shot 2020-07-05 at 13.35.55.png]. h4. Region Per Thread Experiments Implementing Region Per Thread: see the {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms). !Screen Shot 2020-07-05 at 13.48.16.png|width=200px! h4. Costs Region per Thread is an unrealistic solution as it introduces many new issues and problems, from increased memory use to leaking memory and GC issues. It is better tackled as part of a TPC implementation. The backoff approach is simple and elegant, and seems to improve throughput in all situations. It does introduce context switches which may impact throughput in some busy throughput scenarios, so this should to be tested further. was: h4. Problem The method {{NativeAlloca
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Description: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example It has been witnessed up to 33% of CPU time stuck in the {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached [^profile_pbdpc23zafsrh_20200702.svg]. h4. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicReference currentRegion = new AtomicReference<>();{code} with {code}private final ThreadLocal> currentRegion = new ThreadLocal<>() {...};{code} But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. For example with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time. The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm. Applying the Constant Backoff CAS Algorithm involves adding one line of code to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant number) nanoseconds after a CAS failure occurs. That is... {code} // we raced and lost alloc, try again LockSupport.parkNanos(1); {code} h4. Constant Backoff CAS Algorithm Experiments Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. In the attached [^NativeAllocatorRegionTest.java] class, which can be run standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the {{casFailures}} field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the {{NativeAllocatorRegionTest.testRegionCAS}} method. This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 millions allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 1000ns (10ms). The biggest improvement is from no algorithm to a sleep of 1ns where CAS failures are ~two orders of magnitude lower. From a park time 10μs and higher there is a significant drop also at low contention rates. !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! This attached screenshot shows the time it takes to fill a Region (~215 million allocations), using different threads and park times. The biggest improvement is from no algorithm to a sleep of 1ns where performance is one order of magnitude faster. From a park time of 100μs and higher there is a even further significant drop, especially at low contention rates. !Screen Shot 2020-07-05 at 13.26.17.png|width=500px! Repeating the test run show reliably similar results: [^Screen Shot 2020-07-05 at 13.37.01.png] and [^Screen Shot 2020-07-05 at 13.35.55.png]. h4. Region Per Thread Experiments Implementing Region Per Thread: see the {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms). !Screen Shot 2020-07-05 at 13.48.16.png|width=200px! h4. Costs Region per Thread is an unrealistic solution as it introduces many new issues and problems, from increased memory use to leaking memory and GC issues. It is better tackled as part of a TPC implementation. The backoff approach is simple and elegant, and seems to improve throughput in all situations. It does introduce context switches which may impact throughput in some busy throughput scenarios, so this should to be tested further. was: h4. Problem The method {{NativeAllocator.Regi
[jira] [Commented] (CASSANDRA-14902) Update the default for compaction_throughput_mb_per_sec
[ https://issues.apache.org/jira/browse/CASSANDRA-14902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151696#comment-17151696 ] Jeremy Hanna commented on CASSANDRA-14902: -- I assumed that updating to 64 would be uncontroversial because that's what I know many change it to (including myself) as a first step/starting point. If we want to do more extensive comparison testing of different values, that's fine, but I think it would depend on the goal. IO is going to be different for every system and every workload/pattern is going to be somewhat unique. I thought 64 would at least make it not *required* to change it from the default as a first step. > Update the default for compaction_throughput_mb_per_sec > --- > > Key: CASSANDRA-14902 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14902 > Project: Cassandra > Issue Type: Task > Components: Local/Compaction, Local/Config >Reporter: Jeremy Hanna >Assignee: Jeremy Hanna >Priority: Low > > compaction_throughput_mb_per_sec has been at 16 since probably 0.6 or 0.7 > back when a lot of people had to deploy on spinning disks. It seems like it > would make sense to update the default to something more reasonable - > assuming a reasonably decent SSD and competing IO. One idea that could be > bikeshedded to death could be to just default it to 64 - simply to avoid > people from having to always change that any time they download a new version > as well as avoid problems with new users thinking that the defaults are sane. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14902) Update the default for compaction_throughput_mb_per_sec
[ https://issues.apache.org/jira/browse/CASSANDRA-14902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Hanna updated CASSANDRA-14902: - Test and Documentation Plan: Just updated the default value and comments so don't need much. Status: Patch Available (was: In Progress) The pull request: https://github.com/apache/cassandra/pull/662 > Update the default for compaction_throughput_mb_per_sec > --- > > Key: CASSANDRA-14902 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14902 > Project: Cassandra > Issue Type: Task > Components: Local/Compaction, Local/Config >Reporter: Jeremy Hanna >Assignee: Jeremy Hanna >Priority: Low > > compaction_throughput_mb_per_sec has been at 16 since probably 0.6 or 0.7 > back when a lot of people had to deploy on spinning disks. It seems like it > would make sense to update the default to something more reasonable - > assuming a reasonably decent SSD and competing IO. One idea that could be > bikeshedded to death could be to just default it to 64 - simply to avoid > people from having to always change that any time they download a new version > as well as avoid problems with new users thinking that the defaults are sane. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-14902) Update the default for compaction_throughput_mb_per_sec
[ https://issues.apache.org/jira/browse/CASSANDRA-14902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Hanna reassigned CASSANDRA-14902: Assignee: Jeremy Hanna > Update the default for compaction_throughput_mb_per_sec > --- > > Key: CASSANDRA-14902 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14902 > Project: Cassandra > Issue Type: Task > Components: Local/Compaction, Local/Config >Reporter: Jeremy Hanna >Assignee: Jeremy Hanna >Priority: Low > > compaction_throughput_mb_per_sec has been at 16 since probably 0.6 or 0.7 > back when a lot of people had to deploy on spinning disks. It seems like it > would make sense to update the default to something more reasonable - > assuming a reasonably decent SSD and competing IO. One idea that could be > bikeshedded to death could be to just default it to 64 - simply to avoid > people from having to always change that any time they download a new version > as well as avoid problems with new users thinking that the defaults are sane. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Description: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example It has been witnessed up to 33% of CPU time stuck in the {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached [^profile_pbdpc23zafsrh_20200702.svg]. h4. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicReference currentRegion = new AtomicReference<>();{code} with {code}private final ThreadLocal> currentRegion = new ThreadLocal<>() {...};{code} But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. For example with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time. The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm. Applying the Constant Backoff CAS Algorithm involves adding one line of code to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant number) nanoseconds after a CAS failure occurs. That is... {code} // we raced and lost alloc, try again LockSupport.parkNanos(1); {code} h4. Constant Backoff CAS Algorithm Experiments Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. In the attached [^NativeAllocatorRegionTest.java] class, which can be run standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the {{casFailures}} field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the {{NativeAllocatorRegionTest.testRegionCAS}} method. This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 millions allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 1000ns (10ms). The biggest improvement is from no algorithm to a sleep of 1ns where CAS failures are ~two orders of magnitude lower. From a park time 10μs and higher there is a significant drop also at low contention rates. !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! This attached screenshot shows the time it takes to fill a Region (~215 millions allocations), using different threads and park times. The biggest improvement is from no algorithm to a sleep of 1ns where performance is one orders of magnitude faster. From a park time of 100μs and higher there is a even further significant drop, especially at low contention rates. !Screen Shot 2020-07-05 at 13.26.17.png|width=500px! Repeating the test run show reliably similar results: [^Screen Shot 2020-07-05 at 13.37.01.png] and [^Screen Shot 2020-07-05 at 13.35.55.png]. h4. Region Per Thread Experiments Implementing Region Per Thread: see the {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms). !Screen Shot 2020-07-05 at 13.48.16.png|width=200px! h4. Costs Region per Thread is an unrealistic solution as it introduces many new issues and problems, from increased memory use to leaking memory and GC issues. It is better tackled as part of a TPC implementation. The backoff approach is simple and elegant, and seems to improve throughput in all situations. It does introduce context switches which may impact throughput in some busy throughput scenarios, so this should to be tested further. was: h4. Problem The method {{NativeAllocator.Re
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Description: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example It has been witnessed up to 33% of CPU time stuck in the {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached [^profile_pbdpc23zafsrh_20200702.svg]. h4. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicReference currentRegion = new AtomicReference<>();{code} with {code}private final ThreadLocal> currentRegion = new ThreadLocal<>() {...};{code} But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. For example with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time. The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm. Applying the Constant Backoff CAS Algorithm involves adding one line of code to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant number) nanoseconds after a CAS failure occurs. That is... {code} // we raced and lost alloc, try again LockSupport.parkNanos(1); {code} h4. Constant Backoff CAS Algorithm Experiments Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. In the attached [^NativeAllocatorRegionTest.java] class, which can be run standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the {{casFailures}} field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the {{NativeAllocatorRegionTest.testRegionCAS}} method. This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 millions allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 1000ns (10ms). The biggest improvement is from no algorithm to a sleep of 1ns where CAS failures are ~two orders of magnitude lower. From a park time 10μs and higher there is a significant drop also at low contention rates. !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! This attached screenshot shows the time it takes to fill a Region (~215 millions allocations), using different threads and park times. The biggest improvement is from no algorithm to a sleep of 1ns where performance is one orders of magnitude faster. From a park time of 100μs and higher there is a even further significant drop, especially at low contention rates. !Screen Shot 2020-07-05 at 13.26.17.png|width=500px! Repeating the test run show reliably similar results: [^Screen Shot 2020-07-05 at 13.37.01.png] and [^Screen Shot 2020-07-05 at 13.35.55.png]. h4. Region Per Thread Experiments Implementing Region Per Thread: see the {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms). !Screen Shot 2020-07-05 at 13.48.16.png|width=200px! was: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example Up to 33% of CPU time stuck in the {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) has been witnessed in nodes during a heavy spark analytics write load. These nod
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Description: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example Up to 33% of CPU time stuck in the {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) has been witnessed in nodes during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached [^profile_pbdpc23zafsrh_20200702.svg]. h4. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicReference currentRegion = new AtomicReference<>();{code} with {code}private final ThreadLocal> currentRegion = new ThreadLocal<>() {...};{code} But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. For example with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time. The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm. Applying the Constant Backoff CAS Algorithm involves adding one line of code to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant number) nanoseconds after a CAS failure occurs. That is... {code} // we raced and lost alloc, try again LockSupport.parkNanos(1); {code} h4. Constant Backoff CAS Algorithm Experiments Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. In the attached [^NativeAllocatorRegionTest.java] class, which can be run standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the {{casFailures}} field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the {{NativeAllocatorRegionTest.testRegionCAS}} method. This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 millions allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 1000ns (10ms). The biggest improvement is from no algorithm to a sleep of 1ns where CAS failures are ~two orders of magnitude lower. From a park time 10μs and higher there is a significant drop also at low contention rates. !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! This attached screenshot shows the time it takes to fill a Region (~215 millions allocations), using different threads and park times. The biggest improvement is from no algorithm to a sleep of 1ns where performance is one orders of magnitude faster. From a park time of 100μs and higher there is a even further significant drop, especially at low contention rates. !Screen Shot 2020-07-05 at 13.26.17.png|width=500px! Repeating the test run show reliably similar results: [^Screen Shot 2020-07-05 at 13.37.01.png] and [^Screen Shot 2020-07-05 at 13.35.55.png]. h4. Region Per Thread Experiments Implementing Region Per Thread: see the {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms). !Screen Shot 2020-07-05 at 13.48.16.png|width=200px! was: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example Up to 33% of CPU time stuck in the {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) has been witnessed in nodes during a heavy spark analytics write load. The
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Description: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example Up to 33% of CPU time stuck in the {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) has been witnessed in nodes during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached [^profile_pbdpc23zafsrh_20200702.svg]. h4. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicReference currentRegion = new AtomicReference<>();{code} with {code}private final ThreadLocal> currentRegion = new ThreadLocal<>() {...};{code} But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. For example with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time. The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm. Applying the Constant Backoff CAS Algorithm involves adding one line of code to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant number) nanoseconds after a CAS failure occurs. That is... {code} // we raced and lost alloc, try again LockSupport.parkNanos(1); {code} h4. Constant Backoff CAS Algorithm Experiments Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. In the [^NativeAllocatorRegionTest.java] class, which can be run standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the {{casFailures}} field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the {{NativeAllocatorRegionTest.testRegionCAS}} method. This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 millions allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 1000ns (10ms). The biggest improvement is from no algorithm to a sleep of 1ns where CAS failures are ~two orders of magnitude lower. From a park time 10μs and higher there is a significant drop also at low contention rates. !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! This attached screenshot shows the time it takes to fill a Region (~215 millions allocations), using different threads and park times. The biggest improvement is from no algorithm to a sleep of 1ns where performance is one orders of magnitude faster. From a park time of 100μs and higher there is a even further significant drop, especially at low contention rates. !Screen Shot 2020-07-05 at 13.26.17.png|width=500px! Repeating the test run show reliably similar results: [^Screen Shot 2020-07-05 at 13.37.01.png] and [^Screen Shot 2020-07-05 at 13.35.55.png]. h4. Region Per Thread Experiments Implementing Region Per Thread: see the {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms). !Screen Shot 2020-07-05 at 13.48.16.png|width=200px! was: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example Up to 33% of CPU time stuck in the {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) has been witnessed in nodes during a heavy spark analytics write load. These nodes
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Description: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example Up to 33% of CPU time stuck in the {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) has been witnessed in nodes during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached [^profile_pbdpc23zafsrh_20200702.svg]. h4. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicReference currentRegion = new AtomicReference<>();{code} with {code}private final ThreadLocal> currentRegion = new AtomicReference<>() {...};{code} But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. For example with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time. The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm. Applying the Constant Backoff CAS Algorithm involves adding one line of code to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant number) nanoseconds after a CAS failure occurs. That is... {code} // we raced and lost alloc, try again LockSupport.parkNanos(1); {code} h4. Constant Backoff CAS Algorithm Experiments Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. In the [^NativeAllocatorRegionTest.java] class, which can be run standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the {{casFailures}} field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the {{NativeAllocatorRegionTest.testRegionCAS}} method. This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 millions allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 1000ns (10ms). The biggest improvement is from no algorithm to a sleep of 1ns where CAS failures are ~two orders of magnitude lower. From a park time 10μs and higher there is a significant drop also at low contention rates. !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! This attached screenshot shows the time it takes to fill a Region (~215 millions allocations), using different threads and park times. The biggest improvement is from no algorithm to a sleep of 1ns where performance is one orders of magnitude faster. From a park time of 100μs and higher there is a even further significant drop, especially at low contention rates. !Screen Shot 2020-07-05 at 13.26.17.png|width=500px! Repeating the test run show reliably similar results: [^Screen Shot 2020-07-05 at 13.37.01.png] and [^Screen Shot 2020-07-05 at 13.35.55.png]. h4. Region Per Thread Experiments Implementing Region Per Thread: see the {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms). !Screen Shot 2020-07-05 at 13.48.16.png|width=200px! was: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example Up to 33% of CPU time stuck in the {{NativeAllocator.Region.allocate(..)}} loop(due to the CAS failures) has been witnessed in nodes during a heavy spark analytics write load. These no
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Description: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example Up to 33% of CPU time stuck in the {{NativeAllocator.Region.allocate(..)}} loop(due to the CAS failures) has been witnessed in nodes during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached [^profile_pbdpc23zafsrh_20200702.svg]. h4. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicReference currentRegion = new AtomicReference<>();{code} with {code}private final ThreadLocal> currentRegion = new AtomicReference<>() {...};{code} But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. For example with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time. The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm. Applying the Constant Backoff CAS Algorithm involves adding one line of code to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant number) nanoseconds after a CAS failure occurs. That is... {code} // we raced and lost alloc, try again LockSupport.parkNanos(1); {code} h4. Constant Backoff CAS Algorithm Experiments Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. In the [^NativeAllocatorRegionTest.java] class, which can be run standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the {{casFailures}} field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the {{NativeAllocatorRegionTest.testRegionCAS}} method. This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 millions allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 1000ns (10ms). The biggest improvement is from no algorithm to a sleep of 1ns where CAS failures are ~two orders of magnitude lower. From a park time 10μs and higher there is a significant drop also at low contention rates. !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! This attached screenshot shows the time it takes to fill a Region (~215 millions allocations), using different threads and park times. The biggest improvement is from no algorithm to a sleep of 1ns where performance is one orders of magnitude faster. From a park time of 100μs and higher there is a even further significant drop, especially at low contention rates. !Screen Shot 2020-07-05 at 13.26.17.png|width=500px! Repeating the test run show reliably similar results: [^Screen Shot 2020-07-05 at 13.37.01.png] and [^Screen Shot 2020-07-05 at 13.35.55.png]. h4. Region Per Thread Experiments Implementing Region Per Thread: see the {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms). !Screen Shot 2020-07-05 at 13.48.16.png|width=200px! was: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example Up to 33% of CPU time stuck in the {{NativeAllocator.Region.allocate(..)}} loop from the CAS failures, has been witnessed in nodes during a heavy spark analytics write load. These nodes
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Description: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example Up to 33% of CPU time stuck in the {{NativeAllocator.Region.allocate(..)}} loop from the CAS failures, has been witnessed in nodes during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached [^profile_pbdpc23zafsrh_20200702.svg]. h4. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicReference currentRegion = new AtomicReference<>();{code} with {code}private final ThreadLocal> currentRegion = new AtomicReference<>() {...};{code} But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. For example with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time. The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm. Applying the Constant Backoff CAS Algorithm involves adding one line of code to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant number) nanoseconds after a CAS failure occurs. That is... {code} // we raced and lost alloc, try again LockSupport.parkNanos(1); {code} h4. Constant Backoff CAS Algorithm Experiments Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. In the [^NativeAllocatorRegionTest.java] class, which can be run standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the {{casFailures}} field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the {{NativeAllocatorRegionTest.testRegionCAS}} method. This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 millions allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 1000ns (10ms). The biggest improvement is from no algorithm to a sleep of 1ns where CAS failures are ~two orders of magnitude lower. From a park time 10μs and higher there is a significant drop also at low contention rates. !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! This attached screenshot shows the time it takes to fill a Region (~215 millions allocations), using different threads and park times. The biggest improvement is from no algorithm to a sleep of 1ns where performance is one orders of magnitude faster. From a park time of 100μs and higher there is a even further significant drop, especially at low contention rates. !Screen Shot 2020-07-05 at 13.26.17.png|width=500px! Repeating the test run show reliably similar results: [^Screen Shot 2020-07-05 at 13.37.01.png] and [^Screen Shot 2020-07-05 at 13.35.55.png]. h4. Region Per Thread Experiments Implementing Region Per Thread: see the {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms). !Screen Shot 2020-07-05 at 13.48.16.png|width=200px! was: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from the CAS failures, has been witnessed in nodes during a heavy spark analytics write load. These nodes: 40 CP
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Description: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from the CAS failures, has been witnessed in nodes during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached [^profile_pbdpc23zafsrh_20200702.svg]. h4. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicReference currentRegion = new AtomicReference<>();{code} with {code}private final ThreadLocal> currentRegion = new AtomicReference<>() {...};{code} But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. For example with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time. The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm. Applying the Constant Backoff CAS Algorithm involves adding one line of code to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant number) nanoseconds after a CAS failure occurs. That is... {code} // we raced and lost alloc, try again LockSupport.parkNanos(1); {code} h4. Constant Backoff CAS Algorithm Experiments Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. In the [^NativeAllocatorRegionTest.java] class, which can be run standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the {{casFailures}} field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the {{NativeAllocatorRegionTest.testRegionCAS}} method. This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 millions allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 1000ns (10ms). The biggest improvement is from no algorithm to a sleep of 1ns where CAS failures are ~two orders of magnitude lower. From a park time 10μs and higher there is a significant drop also at low contention rates. !Screen Shot 2020-07-05 at 13.16.10.png|width=400px! This attached screenshot shows the time it takes to fill a Region (~215 millions allocations), using different threads and park times. The biggest improvement is from no algorithm to a sleep of 1ns where performance is one orders of magnitude faster. From a park time of 100μs and higher there is a even further significant drop, especially at low contention rates. !Screen Shot 2020-07-05 at 13.26.17.png|width=400px! Repeating the test run show reliably similar results: [^Screen Shot 2020-07-05 at 13.37.01.png] and [^Screen Shot 2020-07-05 at 13.35.55.png]. h4. Region Per Thread Experiments Implementing Region Per Thread: see the {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms). !Screen Shot 2020-07-05 at 13.48.16.png|width=200px! was: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from the CAS failures, has been witnessed in nodes during a heavy spark analytics write load. These nodes: 40 CPU cor
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Description: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from the CAS failures, has been witnessed in nodes during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached [^profile_pbdpc23zafsrh_20200702.svg]. h4. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicReference currentRegion = new AtomicReference<>();{code} with {code}private final ThreadLocal> currentRegion = new AtomicReference<>() {...};{code} But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. For example with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time. The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm. Applying the Constant Backoff CAS Algorithm involves adding one line of code to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant number) nanoseconds after a CAS failure occurs. That is... {code} // we raced and lost alloc, try again LockSupport.parkNanos(1); {code} h4. Constant Backoff CAS Algorithm Experiments Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. In the [^NativeAllocatorRegionTest.java] class, which can be run standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the {{casFailures}} field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the {{NativeAllocatorRegionTest.testRegionCAS}} method. This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 millions allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 1000ns (10ms). The biggest improvement is from no algorithm to a sleep of 1ns where CAS failures are ~two orders of magnitude lower. From a park time 10μs and higher there is a significant drop also at low contention rates. !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! This attached screenshot shows the time it takes to fill a Region (~215 millions allocations), using different threads and park times. The biggest improvement is from no algorithm to a sleep of 1ns where performance is one orders of magnitude faster. From a park time of 100μs and higher there is a even further significant drop, especially at low contention rates. !Screen Shot 2020-07-05 at 13.26.17.png|width=500px! Repeating the test run show reliably similar results: [^Screen Shot 2020-07-05 at 13.37.01.png] and [^Screen Shot 2020-07-05 at 13.35.55.png]. h4. Region Per Thread Experiments Implementing Region Per Thread: see the {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms). !Screen Shot 2020-07-05 at 13.48.16.png|width=200px! was: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from the CAS failures, has been witnessed in nodes during a heavy spark analytics write load. These nodes: 40 CPU cor
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Test and Documentation Plan: existing CI. Status: Patch Available (was: Open) > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 > at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot > 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen > Shot 2020-07-05 at 13.48.16.png, profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, > from the CAS failures, has been witnessed in nodes during a heavy spark > analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > AtomicReference<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the [^NativeAllocatorRegionTest.java] class, which can be run standalone, > the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the > {{casFailures}} field added. The following two screenshots are from data > collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionCAS}} method. > This attached screenshot shows the number of CAS failures during the life of > a Region (over ~215 millions allocations), using different threads and park > times. This illustrates the improvement (reduction) of CAS failures from zero > park time, through orders of magnitude, up to 1000ns (10ms). The biggest > improvement is from no algorithm to a sleep of 1ns where CAS failures are > ~two orders of magnitude lower. From 10μs there is a significant drop also at > low contention rates. > !Screen Shot 2020-07-05 at 13.16.10.png|width=400px! > This attached screenshot shows the time it takes to fill a Region (~215 > millions allocations), using different threads and park times. The biggest > improvement is from no algorithm to a sleep of 1ns where performance is one > orders of magnitude faster. From 100μs there is a even further significant > drop, especially at low contention rates. > !Screen Shot 2020-07-05 at 13.26.17.png|width=400px! > Repeating the test run show reliably similar
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Bug Category: Parent values: Degradation(12984)Level 1 values: Slow Use Case(12996) Complexity: Normal Discovered By: User Report Severity: Normal Status: Open (was: Triage Needed) > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 > at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot > 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen > Shot 2020-07-05 at 13.48.16.png, profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, > from the CAS failures, has been witnessed in nodes during a heavy spark > analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > AtomicReference<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the [^NativeAllocatorRegionTest.java] class, which can be run standalone, > the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the > {{casFailures}} field added. The following two screenshots are from data > collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionCAS}} method. > This attached screenshot shows the number of CAS failures during the life of > a Region (over ~215 millions allocations), using different threads and park > times. This illustrates the improvement (reduction) of CAS failures from zero > park time, through orders of magnitude, up to 1000ns (10ms). The biggest > improvement is from no algorithm to a sleep of 1ns where CAS failures are > ~two orders of magnitude lower. From 10μs there is a significant drop also at > low contention rates. > !Screen Shot 2020-07-05 at 13.16.10.png|width=400px! > This attached screenshot shows the time it takes to fill a Region (~215 > millions allocations), using different threads and park times. The biggest > improvement is from no algorithm to a sleep of 1ns where performance is one > orders of magnitude faster. From 100μs there is a even further significant > drop, especially at low contention r
[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151560#comment-17151560 ] Michael Semb Wever commented on CASSANDRA-15922: Patch for CAS Backoff Contention Management Algorithm at https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922 > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 > at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot > 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen > Shot 2020-07-05 at 13.48.16.png, profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, > from the CAS failures, has been witnessed in nodes during a heavy spark > analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > AtomicReference<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the [^NativeAllocatorRegionTest.java] class, which can be run standalone, > the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the > {{casFailures}} field added. The following two screenshots are from data > collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionCAS}} method. > This attached screenshot shows the number of CAS failures during the life of > a Region (over ~215 millions allocations), using different threads and park > times. This illustrates the improvement (reduction) of CAS failures from zero > park time, through orders of magnitude, up to 1000ns (10ms). The biggest > improvement is from no algorithm to a sleep of 1ns where CAS failures are > ~two orders of magnitude lower. From 10μs there is a significant drop also at > low contention rates. > !Screen Shot 2020-07-05 at 13.16.10.png|width=400px! > This attached screenshot shows the time it takes to fill a Region (~215 > millions allocations), using different threads and park times. The biggest > improvement is from no algorithm to a sleep of 1ns where performance is one > orders of magnitude faster. From 100μs there is a even further significant > drop, especially at low contention rates. > !Screen Sho
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Fix Version/s: 3.11.x 3.0.x 4.0 > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 > at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot > 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen > Shot 2020-07-05 at 13.48.16.png, profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, > from the CAS failures, has been witnessed in nodes during a heavy spark > analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > AtomicReference<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the [^NativeAllocatorRegionTest.java] class, which can be run standalone, > the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the > {{casFailures}} field added. The following two screenshots are from data > collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionCAS}} method. > This attached screenshot shows the number of CAS failures during the life of > a Region (over ~215 millions allocations), using different threads and park > times. This illustrates the improvement (reduction) of CAS failures from zero > park time, through orders of magnitude, up to 1000ns (10ms). The biggest > improvement is from no algorithm to a sleep of 1ns where CAS failures are > ~two orders of magnitude lower. From 10μs there is a significant drop also at > low contention rates. > !Screen Shot 2020-07-05 at 13.16.10.png|width=400px! > This attached screenshot shows the time it takes to fill a Region (~215 > millions allocations), using different threads and park times. The biggest > improvement is from no algorithm to a sleep of 1ns where performance is one > orders of magnitude faster. From 100μs there is a even further significant > drop, especially at low contention rates. > !Screen Shot 2020-07-05 at 13.26.17.png|width=400px! > Repeating the test run show reliably similar results: [^Screen Shot > 2020-07-
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Description: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from the CAS failures, has been witnessed in nodes during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached [^profile_pbdpc23zafsrh_20200702.svg]. h4. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicReference currentRegion = new AtomicReference<>();{code} with {code}private final ThreadLocal> currentRegion = new AtomicReference<>() {...};{code} But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. For example with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time. The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm. Applying the Constant Backoff CAS Algorithm involves adding one line of code to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant number) nanoseconds after a CAS failure occurs. That is... {code} // we raced and lost alloc, try again LockSupport.parkNanos(1); {code} h4. Constant Backoff CAS Algorithm Experiments Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. In the [^NativeAllocatorRegionTest.java] class, which can be run standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the {{casFailures}} field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the {{NativeAllocatorRegionTest.testRegionCAS}} method. This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 millions allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 1000ns (10ms). The biggest improvement is from no algorithm to a sleep of 1ns where CAS failures are ~two orders of magnitude lower. From 10μs there is a significant drop also at low contention rates. !Screen Shot 2020-07-05 at 13.16.10.png|width=400px! This attached screenshot shows the time it takes to fill a Region (~215 millions allocations), using different threads and park times. The biggest improvement is from no algorithm to a sleep of 1ns where performance is one orders of magnitude faster. From 100μs there is a even further significant drop, especially at low contention rates. !Screen Shot 2020-07-05 at 13.26.17.png|width=400px! Repeating the test run show reliably similar results: [^Screen Shot 2020-07-05 at 13.37.01.png] and [^Screen Shot 2020-07-05 at 13.35.55.png]. h4. Region Per Thread Experiments Implementing Region Per Thread: see the {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms). !Screen Shot 2020-07-05 at 13.48.16.png|width=200px! was: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from the CAS failures, has been witnessed in nodes during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{mem
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Description: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from the CAS failures, has been witnessed in nodes during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached [^profile_pbdpc23zafsrh_20200702.svg]. h4. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicReference currentRegion = new AtomicReference<>();{code} with {code}private final ThreadLocal> currentRegion = new AtomicReference<>() {...};{code} But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. For example with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time. The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm. Applying the Constant Backoff CAS Algorithm involves adding one line of code to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant number) nanoseconds after a CAS failure occurs. That is... {code} // we raced and lost alloc, try again LockSupport.parkNanos(1); {code} h4. Constant Backoff CAS Algorithm Experiments Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. In the [^NativeAllocatorRegionTest.java] class, which can be run standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the {{casFailures}} field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the {{NativeAllocatorRegionTest.testRegionCAS}} method. This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 millions allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 1000ns (10ms). The biggest improvement is from no algorithm to a sleep of 1ns where CAS failures are ~two orders of magnitude lower. From 10μs there is a significant drop also at low contention rates. !Screen Shot 2020-07-05 at 13.16.10.png|width=200px! This attached screenshot shows the time it takes to fill a Region (~215 millions allocations), using different threads and park times. The biggest improvement is from no algorithm to a sleep of 1ns where performance is one orders of magnitude faster. From 100μs there is a even further significant drop, especially at low contention rates. !Screen Shot 2020-07-05 at 13.26.17.png|width=200px! Repeating the test run show reliably similar results: [^Screen Shot 2020-07-05 at 13.37.01.png] and [^Screen Shot 2020-07-05 at 13.35.55.png]. h4. Region Per Thread Experiments Implementing Region Per Thread: see the {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms). !Screen Shot 2020-07-05 at 13.48.16.png|width=100px! was: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from the CAS failures, has been witnessed in nodes during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{mem
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Description: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from the CAS failures, has been witnessed in nodes during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached [^profile_pbdpc23zafsrh_20200702.svg]. h4. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicReference currentRegion = new AtomicReference<>();{code} with {code}private final ThreadLocal> currentRegion = new AtomicReference<>() {...};{code} But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time. The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm. Applying the Constant Backoff CAS Algorithm involves adding one line of code to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant number) nanoseconds after a CAS failure occurs. That is... {code} // we raced and lost alloc, try again LockSupport.parkNanos(1); {code} h4. Constant Backoff CAS Algorithm Experiments Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. In the [^NativeAllocatorRegionTest.java] class, which can be run standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the {{casFailures}} field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the {{NativeAllocatorRegionTest.testRegionCAS}} method. This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 millions allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 1000ns (10ms). The biggest improvement is from no algorithm to a sleep of 1ns where CAS failures are ~two orders of magnitude lower. From 10μs there is a significant drop also at low contention rates. !Screen Shot 2020-07-05 at 13.16.10.png|width=200px! This attached screenshot shows the time it takes to fill a Region (~215 millions allocations), using different threads and park times. The biggest improvement is from no algorithm to a sleep of 1ns where performance is one orders of magnitude faster. From 100μs there is a even further significant drop, especially at low contention rates. !Screen Shot 2020-07-05 at 13.26.17.png|width=200px! Repeating the test run show reliably similar results: [^Screen Shot 2020-07-05 at 13.37.01.png] and [^Screen Shot 2020-07-05 at 13.35.55.png]. h4. Region Per Thread Experiments Implementing Region Per Thread: see the {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms). !Screen Shot 2020-07-05 at 13.48.16.png|width=100px! was: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from the CAS failures, has been witnessed in nodes during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 51
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Description: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from the CAS failures, has been witnessed in nodes during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous [^profile_pbdpc23zafsrh_20200702.svg] demonstrate the problem. See attached. h4. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicReference currentRegion = new AtomicReference<>();{code} with {code}private final ThreadLocal> currentRegion = new AtomicReference<>() {...};{code} But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time. The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm. Applying the Constant Backoff CAS Algorithm involves adding one line of code to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant number) nanoseconds after a CAS failure occurs. That is... {code} // we raced and lost alloc, try again LockSupport.parkNanos(1); {code} h4. Constant Backoff CAS Algorithm Experiments Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. In the [^NativeAllocatorRegionTest.java] class, which can be run standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the {{casFailures}} field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the {{NativeAllocatorRegionTest.testRegionCAS}} method. This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 millions allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 1000ns (10ms). The biggest improvement is from no algorithm to a sleep of 1ns where CAS failures are ~two orders of magnitude lower. From 10μs there is a significant drop also at low contention rates. !Screen Shot 2020-07-05 at 13.16.10.png! This attached screenshot shows the time it takes to fill a Region (~215 millions allocations), using different threads and park times. The biggest improvement is from no algorithm to a sleep of 1ns where performance is one orders of magnitude faster. From 100μs there is a even further significant drop, especially at low contention rates. !Screen Shot 2020-07-05 at 13.26.17.png! Repeating the test run show reliably similar results: !Screen Shot 2020-07-05 at 13.37.01.png! and !Screen Shot 2020-07-05 at 13.35.55.png! . h4. Region Per Thread Experiments Implementing Region Per Thread: see the {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms). !Screen Shot 2020-07-05 at 13.48.16.png! was: h6. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h6. Example CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from the CAS failures, has been witnessed in nodes during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous [^p
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Description: h6. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h6. Example CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from the CAS failures, has been witnessed in nodes during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous [^profile_pbdpc23zafsrh_20200702.svg] demonstrate the problem. See attached. h6. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicReference currentRegion = new AtomicReference<>();{code} with {code}private final ThreadLocal> currentRegion = new AtomicReference<>() {...};{code} But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. h6. Suggestion: Simple Contention Management Algorithm (Constant Backoff) Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time. The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm. Applying the Constant Backoff CAS Algorithm involves adding one line of code to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant number) nanoseconds after a CAS failure occurs. That is... {code} // we raced and lost alloc, try again LockSupport.parkNanos(1); {code} h6. Constant Backoff CAS Algorithm Experiments Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. In the [^NativeAllocatorRegionTest.java] class, which can be run standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the {{casFailures}} field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the {{NativeAllocatorRegionTest.testRegionCAS}} method. This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 millions allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 1000ns (10ms). The biggest improvement is from no algorithm to a sleep of 1ns where CAS failures are ~two orders of magnitude lower. From 10μs there is a significant drop also at low contention rates. This attached screenshot shows the time it takes to fill a Region (~215 millions allocations), using different threads and park times. The biggest improvement is from no algorithm to a sleep of 1ns where performance is one orders of magnitude faster. From 100μs there is a even further significant drop, especially at low contention rates. Repeating the test run show reliably similar results: screenshot1 and screenshot2. h6. Region Per Thread Experiments Implementing Region Per Thread: see the {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms). was: h6. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h6. Example CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from the CAS failures, in nodes during heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached. h6. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing N
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Description: h6. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h6. Example CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from the CAS failures, in nodes during heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached. h6. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicReference currentRegion = new AtomicReference<>();{code} with {code}private final ThreadLocal> currentRegion = new AtomicReference<>() {...};{code} But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. h6. Suggestion: Simple Contention Management Algorithm (Constant Backoff) Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time. The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm. Applying the Constant Backoff CAS Algorithm involves adding one line of code to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant number) nanoseconds after a CAS failure occurs. That is... {code} // we raced and lost alloc, try again LockSupport.parkNanos(1); {code} h6. Constant Backoff CAS Algorithm Experiments Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. In the [^NativeAllocatorRegionTest.java] class, which can be run standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the {{casFailures}} field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the {{NativeAllocatorRegionTest.testRegionCAS}} method. This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 millions allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 1000ns (10ms). The biggest improvement is from no algorithm to a sleep of 1ns where CAS failures are ~two orders of magnitude lower. From 10μs there is a significant drop also at low contention rates. This attached screenshot shows the time it takes to fill a Region (~215 millions allocations), using different threads and park times. The biggest improvement is from no algorithm to a sleep of 1ns where performance is one orders of magnitude faster. From 100μs there is a even further significant drop, especially at low contention rates. Repeating the test run show reliably similar results: screenshot1 and screenshot2. h6. Region Per Thread Experiments Implementing Region Per Thread: see the {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms). was: h6. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h6. Example CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from the CAS failures, in nodes during heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached. h6. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicRefer
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Description: h6. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h6. Example CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from the CAS failures, in nodes during heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached. h6. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicReference currentRegion = new AtomicReference<>();{code} with {code}private final ThreadLocal> currentRegion = new AtomicReference<>() {...};{code} But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. h6. Suggestion: Simple Contention Management Algorithm (Constant Backoff) Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time. The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm. Applying the Constant Backoff CAS Algorithm involves adding one line of code to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant number) nanoseconds after a CAS failure occurs. That is... {code} // we raced and lost alloc, try again LockSupport.parkNanos(1); {code} h6. Constant Backoff CAS Algorithm Experiments Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. In the attached {{NativeAllocatorRegionTest}} class, which can be run standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the {{casFailures}} field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the {{NativeAllocatorRegionTest.testRegionCAS}} method. This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 millions allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 1000ns (10ms). The biggest improvement is from no algorithm to a sleep of 1ns where CAS failures are ~two orders of magnitude lower. From 10μs there is a significant drop also at low contention rates. This attached screenshot shows the time it takes to fill a Region (~215 millions allocations), using different threads and park times. The biggest improvement is from no algorithm to a sleep of 1ns where performance is one orders of magnitude faster. From 100μs there is a even further significant drop, especially at low contention rates. Repeating the test run show reliably similar results: screenshot1 and screenshot2. h6. Region Per Thread Experiments Implementing Region Per Thread: see the {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms). was: h6. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h6. Example CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from the CAS failures, in nodes during heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached. h6. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicRef
[jira] [Created] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
Michael Semb Wever created CASSANDRA-15922: -- Summary: High CAS failures in NativeAllocator.Region.allocate(..) Key: CASSANDRA-15922 URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 Project: Cassandra Issue Type: Bug Components: Local/Memtable Reporter: Michael Semb Wever Assignee: Michael Semb Wever Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 at 13.48.16.png, profile_pbdpc23zafsrh_20200702.svg h6. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h6. Example CPU time of 33% stuck in the {{NativeAllocator.Region.allocate(..)}} loop, from the CAS failures, in nodes during heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached. h6. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicReference currentRegion = new AtomicReference<>();{code} with {code}private final ThreadLocal> currentRegion = new AtomicReference<>() {...};{code} But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. h6. Suggestion: Simple Contention Management Algorithm (Constant Backoff) Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time. The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm. Applying the Constant Backoff CAS Algorithm involves adding one line of code to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant number) nanoseconds after a CAS failure occurs. That is... {code} // we raced and lost alloc, try again LockSupport.parkNanos(1); {code} h6. Constant Backoff CAS Algorithm Experiments Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. In the attached {{NativeAllocatorRegionTest}} class, which can be run standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the {{casFailures}} field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the {{NativeAllocatorRegionTest.testRegionCAS}} method. This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 millions allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 1000ns (10ms). The biggest improvement is from no algorithm to a sleep of 1ns where CAS failures are ~two orders of magnitude lower. From 10μs there is a significant drop also at low contention rates. This attached screenshot shows the time it takes to fill a Region (~215 millions allocations), using different threads and park times. The biggest improvement is from no algorithm to a sleep of 1ns where performance is one orders of magnitude faster. From 100μs there is a even further significant drop, especially at low contention rates. Repeating the test run show reliably similar results: screenshot1 and screenshot2. h6. Region Per Thread Experiments Implementing Region Per Thread: see the {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org