[ 
https://issues.apache.org/jira/browse/CASSANDRA-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230181#comment-14230181
 ] 

Benedict commented on CASSANDRA-6976:
-------------------------------------

bq. I recall someone on the Mechanical Sympathy group pointing out that you can 
warm an entire last level cache in some small amount of time, I think it was 
30ish milliseconds. I can't find the post and I could be very wrong, but it was 
definitely milliseconds. My guess is that in the big picture cache effects 
aren't changing the narrative that this takes 10s to 100s of milliseconds.

Sure it does - if an action that is likely memory bound (like this one - after 
all, it does very little in the way of computation and doesn't touch any disk) 
takes time X with a warmed cache, and only touches data that can fit in cache, 
it will take X*K with a cold cache for some K (significantly) > 1 - and in real 
operation, especially with many tokens, there is a quite reasonable likelihood 
of a cold cache given the lack of locality and amount of data as the cluster 
grows. This is actually one possibility for improving this behaviour, if we 
cared at all - ensuring the number of cache lines touched is kept low, working 
with primitives for the token ranges and inet addresses to reduce the constant 
factors. This would also improve the normal code paths, not just range slices.

bq. If it is slow, what is the solution? Even if we lazily materialize the 
ranges the run time of fetching batches of results dominates the in-memory 
compute of getRestrictedRanges. When we talked use cases it seems like people 
would using paging programmatically so only console users would see this poor 
performance outside of the lookup table use case you mentioned.

For a lookup (i.e. small) table query, or a range query that can be serviced 
entirely by the local node, it is quite unlikely that the fetching would 
dominate when talking about timescales >= 1ms.

bq. I didn't quite follow this. Are you talking about getLiveSortedEndpoints 
called from getRangeSlice? I haven't dug deep enough into getRangeSlice to tell 
you where the time in that goes exactly. I would have to do it again and insert 
some probes. I assumed it was dominated by sending remote requests.

Yes - for your benchmark it would not have spent any much time here, since the 
sort would be a no-op and the list a single entry, but as the number of data 
centres and replication factor grows, and with use of NetworkTopologyStrategy, 
this could be a significant time expenditure. It will also on the aggregate 
affect a certain percentage of cpu time spent on all queries. However since the 
sort order is actually pretty consistent, sorting only when the sort order 
changes would be a way to eliminate this cost.

bq. Benchmarking in what scope? This microbenchmark, defaults for workloads in 
cstar, tribal knowledge when doing performance work?

Like I said, please do feel to drop this particular line of enquiry for the 
moment, since even with all of the above I doubt this is a pressing matter. But 
I don't think this is the end of the topic entirely - at some point this cost 
will be a more measurable percentage of.work done. But these kinds of costs are 
simply not a part of any of our current benchmarking methodology since our 
default configs avoid the code paths entirely (either by having no DCs, low RF, 
low node count, no tokens, and SimpleStrategy), and that is something we should 
address. 

In the meantime it might be worth having a simple short-circuit path for 
queries that may be answered by the local node only, though.

> Determining replicas to query is very slow with large numbers of nodes or 
> vnodes
> --------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-6976
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6976
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benedict
>            Assignee: Ariel Weisberg
>              Labels: performance
>         Attachments: GetRestrictedRanges.java, jmh_output.txt, 
> jmh_output_murmur3.txt, make_jmh_work.patch
>
>
> As described in CASSANDRA-6906, this can be ~100ms for a relatively small 
> cluster with vnodes, which is longer than it will spend in transit on the 
> network. This should be much faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to