[ 
https://issues.apache.org/jira/browse/CASSANDRA-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595509#comment-14595509
 ] 

Benedict commented on CASSANDRA-8894:
-------------------------------------

{{estimatedRowSize.mean()}} is probably the best number to use, but it's a bit 
expensive to call for every operation (so let's memoize it). 

For the index file, we scan a number of index records and ideally want them all 
to be read in one go. So we need to ask the IndexSummary to tell us how many 
records are in the scan range we've found (by calling 
getEffectiveIndexIntervalAfterIndex, and to divide the file length by this (and 
round up). This will probably leave us with quite big buffers for the index 
files, but with CASSANDRA-8931 it should shrink dramatically (which is an 
excellent follow up to this).

Then we have a decision to make regarding alignment of our reads. I'm of the 
opinion we should align them, so that we don't issue more read operations than 
necessary. If so, we should put a floor of 4K on the size of the buffer, since 
we cannot read less than this anyway (if we don't read aligned, we will cross 
alignment boundaries, so our buffer size won't dictate how many reads we 
perform). This would also mean we probably want to size our buffer to >= ~4x 
estimatedRowSize.mean(), though, so we have a high likelihood of reading the 
whole row in our read operation (2x to make sure the average is not too small, 
and 2x to make sure we don't miss it through alignment).

WDYT?

> Our default buffer size for (uncompressed) buffered reads should be smaller, 
> and based on the expected record size
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8894
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8894
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Stefania
>             Fix For: 3.x
>
>
> A large contributor to slower buffered reads than mmapped is likely that we 
> read a full 64Kb at once, when average record sizes may be as low as 140 
> bytes on our stress tests. The TLB has only 128 entries on a modern core, and 
> each read will touch 32 of these, meaning we are unlikely to almost ever be 
> hitting the TLB, and will be incurring at least 30 unnecessary misses each 
> time (as well as the other costs of larger than necessary accesses). When 
> working with an SSD there is little to no benefit reading more than 4Kb at 
> once, and in either case reading more data than we need is wasteful. So, I 
> propose selecting a buffer size that is the next larger power of 2 than our 
> average record size (with a minimum of 4Kb), so that we expect to read in one 
> operation. I also propose that we create a pool of these buffers up-front, 
> and that we ensure they are all exactly aligned to a virtual page, so that 
> the source and target operations each touch exactly one virtual page per 4Kb 
> of expected record size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to