[ 
https://issues.apache.org/jira/browse/CASSANDRA-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146132#comment-14146132
 ] 

Jay Patel edited comment on CASSANDRA-7882 at 9/24/14 4:31 PM:
---------------------------------------------------------------

[~benedict], I've attached the first cut. Pls. help review. 

Below are some code changes and design choices/trade-offs. 

* Wait-free region scaling and allocations:

** Instead of one global queue of 1 MB race allocated regions, there’re now set 
of global queues, one for each region size (8K, 16K, ..1MB). All queues are 
global (not per memtable) so memtables across all the tables can reuse the race 
allotted regions. Race allocated regions will never be discarded during 
memtable flushes, same as before. 

** A thread who wins in the race of setting (CAS) a newly allocated region as 
the current region, also scales the region size (if it’s not already at the 
max). This avoids need for extra synchronization for scaling region size 
atomically.

* Region size per memtable:
Region size is now per memtable instead of global. From what I understand, each 
memtable creates its own NativeAllocator object. So, I think keeping a region 
size as a member variable of NativeAllocator makes the region size per 
memtable. Pls. let me know if that is not the case so can fix accordingly.

I don’t think below can be the issue, but want to share in case you see any 
problems:

* In the race of allocating & setting the current region, there is a very 
slight chance of setting a current region with the same size during the scale 
phase. Consider the below case:
Thread1: allocates 16K region but has not yet reached to CAS for the current 
region
Thread 2: allocates 16K, does CAS for the current region. Current region gets 
filled up and set back to null by allocate() method.
Thread 1: reaches the CAS. Now, this will set current region to 16K region, 
instead of 32K.
This sounds a corner case. Btw, even if this happens, there is no harm & next 
allocation will be directly of 64K to catch up since we never miss scaling the 
region size.
To further guard against this, added a below check in the code just before CAS. 
if(region.capacity == regionSize * SCALE_FACTOR || region.capacity == 
MAX_REGION_SIZE)

* Unslab allocation (if any) will happen only after region hits its max of 1 
MB, same as before. Until then, big payload can quickly grow the region size & 
allocate new regions. I think this behavior is good but slight side effect is 
that we may end up with few partially filled (or not filled?) regions depending 
on the traffic & payload type. One option is to have unslabbed allocation 
threshold and count for each region size, in addition to the MAX_CLONED_SIZE 
for the 1MB region size. For instance, with 8K region, anything beyond 4K 
(threshold) will be unslabbed and if we see 1000 (count) of unslab allocations, 
increase the region size. But, not too excited about this since anyway flush 
may happen before that to reset the region size. I don’t see much issue leaving 
this as is for now, but let me know if you think we need to address this. 

fyi, below line will print how region allocation works if you want to test. I 
did a quick test with 1 to 100 static tables and payload size from 100 bytes to 
2kb. In a week or two, planning to try out with 10s of thousands of tables 
including longevity tests.

logger.info("{} size region allocated in {}", regionSize, this);

This change takes care of only off-heap objects. For other slab allocator 
(on-heap?), not sure if it makes sense to do region scaling. 

Todo: Convert multiplication to shifting. Change logger.info to logger.trace. 
Any refactoring or  suggestions you've. 


was (Author: pateljay3001):
Hey Benedict, 
I've attached the first cut. Pls. help review. 

Below are some code changes and design choices/trade-offs. 

* Wait-free region scaling and allocations:

** Instead of one global queue of 1 MB race allocated regions, there’re are now 
set of global queues, one for each region size (8K, 16K, ..1MB). All queues are 
global (not per memtable) so memtables across all the tables can reuse the race 
allotted regions. Race allocated regions will never be discarded during 
memtable flushes, same as before. 

** Thread who wins in the race of setting new region as the current region, 
also scales the region size (if it’s not already at the max). This avoids need 
for extra synchronization for scaling region size atomically.

* Region size per memtable:
Have region size per memtable instead of global. From what I understand from 
the code, each memtable creates its own NativeAllocator object. So, I think 
keeping a region size as a member variable of NativeAllocator makes the region 
size per memtable. Pls. let me know if that is not the case & I’ll fix it 
accordingly.

I don’t think below can be the issue, but want to share in case you see any 
problems:

* In the race of allocating & setting the current region, there is a very 
slight chance of setting a current region with the same size during the scale 
phase (instead of 2x). Consider the below case:
Thread1: allocates 16K region but has not yet reached to CAS for the current 
region
Thread 2: allocates 16K, does CAS for the current region. Current region gets 
filled up and set back to null by allocate() method.
Thread 1: reaches the CAS. Now, this will set current region to 16K region, 
instead of 32K.
This sounds a corner case. Btw, even if this happens, there is no harm & next 
allocation will be directly of 64K to catch up since we never miss scaling the 
region size.
To further guard against this, you'll see a below check in the code just before 
CAS. 
if(region.capacity == regionSize * SCALE_FACTOR || region.capacity == 
MAX_REGION_SIZE)

* Unslab allocation (if any) will happen only after region hits its max of 1 
MB, same as before. Until then, big payload can quickly grow the region size & 
allocate new regions. I think this behavior is good but slight side effect is 
that we may end up with few partially filled (or not filled?) regions before we 
scale up to the proper region size for a given payload. One option I though of 
is to have unslabbed allocation threshold and count for each region size, in 
addition to the MAX_CLONED_SIZE for the 1MB region size. For instance, with 8K 
region, anything beyond 4K (threshold) will be unslabbed and if we see 1000 
(count) of unslab allocations, increase the region size. But, not too excited 
about this since anyway flush may happen before that to reset the region size. 
I don’t see much issue leaving this as is for now, but let me know if you can 
think of a better way to address this.

fyi, below line will print how region allocation works if you want to test. I 
did quick test with 1 to 100 static tables and payload size from 100 bytes to 
2kb. In a week or two, planning to try out with 10s of thousands of tables 
including longevity tests.

logger.info("{} size region allocated in {}", regionSize, this);

This change takes care of only off-heap objects. For other slab allocator 
(on-heap?), not sure if it makes sense to do region scaling. 

Todo: Convert multiplication to shifting. Change logger.info to logger.trace. 
Any refactoring or  suggestions you've. 

> Memtable slab allocation should scale logarithmically to improve occupancy 
> rate
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-7882
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7882
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jay Patel
>            Assignee: Jay Patel
>              Labels: performance
>             Fix For: 2.1.1
>
>         Attachments: trunk-7882.txt
>
>
> CASSANDRA-5935 allows option to disable region-based allocation for on-heap 
> memtables but there is no option to disable it for off-heap memtables 
> (memtable_allocation_type: offheap_objects). 
> Disabling region-based allocation will allow us to pack more tables in the 
> schema since minimum of 1MB region won't be allocated per table. Downside can 
> be more fragmentation which should be controllable by using better allocator 
> like JEMalloc.
> How about below option in yaml?:
> memtable_allocation_type: unslabbed_offheap_objects
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to