[ 
https://issues.apache.org/jira/browse/CASSANDRA-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630928#comment-14630928
 ] 

Stefania commented on CASSANDRA-8894:
-------------------------------------

[~benedict] I went ahead and implemented the latest suggested optimization in 
this commit 
[here|https://github.com/stef1927/cassandra/commit/ad6712cdc12380ef0529a13ed6e9bd1c5cecebad].
 I've also attached tentative stress yaml profiles, which I intend to run like 
this:

{code}
user profile=https://dl.dropboxusercontent.com/u/15683245/8894_tiny.yaml  
ops\(insert=1,\) n=100000 -rate threads=50
user profile=https://dl.dropboxusercontent.com/u/15683245/8894_tiny.yaml  
ops\(singleblob=1,\) n=100000 -rate threads=50
{code}

Can you confirm the profiles are what you intended, basically a partition id 
and a blob column with the size distributed as you previously indicated. I'm 
not sure if there is anything else I should do to ensure reads mostly hit disk 
- other than spreading the partition id across a bit interval? 

I created these additional branches:
- trunk-pre-8099
- 8894-pre-8099
- 8894-pre-8099-first-optim
- 8894-first-optim

The names are self describing except for "first-optim" which means before 
implementing the latest optimization. A tag would have been enough but cstar 
perf does not support it.

Unfortunately cstar perf has been giving me more problems other than tags, cc 
[~enigmacurry]:

* The old trunk branches pre 8099 fail due to the schema tables changes 
(http://cstar.datastax.com/tests/id/e134ee7e-2c46-11e5-a180-42010af0688f) : 
"InvalidQueryException: Keyspace system_schema does not exist". However I think 
if we fake version 2.2 in build.xml we should be OK.
* The new branches either fail because of a nodetool failure 
(http://cstar.datastax.com/tests/id/86abc144-2c55-11e5-87b9-42010af0688f) or 
the graphs are wrong 
(http://cstar.datastax.com/tests/id/11fe9c5a-2c45-11e5-9760-42010af0688f).

Here is the nodetool failure:

{code}
[10.200.241.104] Executing task 'ensure_running'
[10.200.241.104] run: JAVA_HOME=~/fab/jvms/jdk1.8.0_45 
~/fab/cassandra/bin/nodetool ring
[10.200.241.104] out: error: null
[10.200.241.104] out: -- StackTrace --
[10.200.241.104] out: java.util.NoSuchElementException
[10.200.241.104] out:   at 
com.google.common.collect.LinkedHashMultimap$1.next(LinkedHashMultimap.java:506)
[10.200.241.104] out:   at 
com.google.common.collect.LinkedHashMultimap$1.next(LinkedHashMultimap.java:494)
[10.200.241.104] out:   at 
com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
[10.200.241.104] out:   at java.util.Collections.max(Collections.java:708)
[10.200.241.104] out:   at 
org.apache.cassandra.tools.nodetool.Ring.execute(Ring.java:63)
[10.200.241.104] out:   at 
org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:240)
[10.200.241.104] out:   at 
org.apache.cassandra.tools.NodeTool.main(NodeTool.java:154)
[10.200.241.104] out: 
[10.200.241.104] out: 
{code}

I'll resume the performance tests once cstar perf is stable again.


> Our default buffer size for (uncompressed) buffered reads should be smaller, 
> and based on the expected record size
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8894
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8894
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Stefania
>              Labels: benedict-to-commit
>             Fix For: 3.x
>
>         Attachments: 8894_25pct.yaml, 8894_5pct.yaml, 8894_tiny.yaml
>
>
> A large contributor to slower buffered reads than mmapped is likely that we 
> read a full 64Kb at once, when average record sizes may be as low as 140 
> bytes on our stress tests. The TLB has only 128 entries on a modern core, and 
> each read will touch 32 of these, meaning we are unlikely to almost ever be 
> hitting the TLB, and will be incurring at least 30 unnecessary misses each 
> time (as well as the other costs of larger than necessary accesses). When 
> working with an SSD there is little to no benefit reading more than 4Kb at 
> once, and in either case reading more data than we need is wasteful. So, I 
> propose selecting a buffer size that is the next larger power of 2 than our 
> average record size (with a minimum of 4Kb), so that we expect to read in one 
> operation. I also propose that we create a pool of these buffers up-front, 
> and that we ensure they are all exactly aligned to a virtual page, so that 
> the source and target operations each touch exactly one virtual page per 4Kb 
> of expected record size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to