[
https://issues.apache.org/jira/browse/CASSANDRA-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14616451#comment-14616451
]
Stefania commented on CASSANDRA-8894:
-------------------------------------
Thank you for your input [~benedict]. I plan on resuming this in a few days
once I have wrapped up a couple of other tickets.
The rebase after the merge of 8099 wasn't particularly hard for this ticket so
hopefully rebasing back to 2.2 for performance comparisons should be doable
without too much pain.
Regarding exposing the knobs to the users, it would be nice to detect
automatically if we are reading from SSD or rotational disks but I don't think
it can be done on all platforms, it should be doable only on recent Linux
kernels. I don't think it's unreasonable to ask the user which disk type they
are targeting, the question is should we support more than one type (one per
data directory) or just keep it simple and have a global setting only?
bq. For SSDs, we probably want to read an extra page only if there is >, say, a
10% chance of crossing the page boundary for our read; otherwise we may as well
do the extra read as necessary.
The chance of crossing the page boundary should be calculated as the ratio
between, say, 1.28 sigma (90th percentile) and the page size? If it's greater
than one then we read to the next boundary else we round down?
> Our default buffer size for (uncompressed) buffered reads should be smaller,
> and based on the expected record size
> ------------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-8894
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8894
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Benedict
> Assignee: Stefania
> Labels: benedict-to-commit
> Fix For: 3.x
>
>
> A large contributor to slower buffered reads than mmapped is likely that we
> read a full 64Kb at once, when average record sizes may be as low as 140
> bytes on our stress tests. The TLB has only 128 entries on a modern core, and
> each read will touch 32 of these, meaning we are unlikely to almost ever be
> hitting the TLB, and will be incurring at least 30 unnecessary misses each
> time (as well as the other costs of larger than necessary accesses). When
> working with an SSD there is little to no benefit reading more than 4Kb at
> once, and in either case reading more data than we need is wasteful. So, I
> propose selecting a buffer size that is the next larger power of 2 than our
> average record size (with a minimum of 4Kb), so that we expect to read in one
> operation. I also propose that we create a pool of these buffers up-front,
> and that we ensure they are all exactly aligned to a virtual page, so that
> the source and target operations each touch exactly one virtual page per 4Kb
> of expected record size.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)