[
https://issues.apache.org/jira/browse/CASSANDRA-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14617223#comment-14617223
]
Benedict commented on CASSANDRA-8894:
-------------------------------------
bq. the question is should we support more than one type (one per data
directory) or just keep it simple and have a global setting only?
That, and also I was referring to the chance calculations. i.e., the size
percentile on which to calculate the chance of crossing a page boundary, and
the threshold at which we use the resulting chance to add an extra page to the
read. These would be very helpful for some users to have access to, and also
for later performance tuning on our end, however I don't think it warrants
explicit mention in the yaml.
bq. The chance of crossing the page boundary should be calculated as the ratio
between
Well, that depends (if I'm interpreting your use of sigma correctly). Is a
normal distribution the correct assumption? Perhaps we could do better here,
and also reduce the number of knobs, by calculating the chance of every size
percentile (or decile, or some other quanta) crossing a page boundary, so that
we make no assumptions about the distribution. What I was suggesting above was
taking a high size percentile (95%, was my suggestion), and just assuming
everything is that size or smaller. I think we have to assume a uniform
distribution of _start position_ within a page, which for any gives its chance
of crossing a boundary straight-forwardly as {{(size % 4096) / 4096}}.
I don't think it matters _too_ much which strategy we use here, though. So long
as it's something exploiting this basic approach.
> Our default buffer size for (uncompressed) buffered reads should be smaller,
> and based on the expected record size
> ------------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-8894
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8894
> Project: Cassandra
> Issue Type: Improvement
> Components: Core
> Reporter: Benedict
> Assignee: Stefania
> Labels: benedict-to-commit
> Fix For: 3.x
>
>
> A large contributor to slower buffered reads than mmapped is likely that we
> read a full 64Kb at once, when average record sizes may be as low as 140
> bytes on our stress tests. The TLB has only 128 entries on a modern core, and
> each read will touch 32 of these, meaning we are unlikely to almost ever be
> hitting the TLB, and will be incurring at least 30 unnecessary misses each
> time (as well as the other costs of larger than necessary accesses). When
> working with an SSD there is little to no benefit reading more than 4Kb at
> once, and in either case reading more data than we need is wasteful. So, I
> propose selecting a buffer size that is the next larger power of 2 than our
> average record size (with a minimum of 4Kb), so that we expect to read in one
> operation. I also propose that we create a pool of these buffers up-front,
> and that we ensure they are all exactly aligned to a virtual page, so that
> the source and target operations each touch exactly one virtual page per 4Kb
> of expected record size.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)