[ https://issues.apache.org/jira/browse/CASSANDRA-6976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230181#comment-14230181 ]
Benedict commented on CASSANDRA-6976: ------------------------------------- bq. I recall someone on the Mechanical Sympathy group pointing out that you can warm an entire last level cache in some small amount of time, I think it was 30ish milliseconds. I can't find the post and I could be very wrong, but it was definitely milliseconds. My guess is that in the big picture cache effects aren't changing the narrative that this takes 10s to 100s of milliseconds. Sure it does - if an action that is likely memory bound (like this one - after all, it does very little in the way of computation and doesn't touch any disk) takes time X with a warmed cache, and only touches data that can fit in cache, it will take X*K with a cold cache for some K (significantly) > 1 - and in real operation, especially with many tokens, there is a quite reasonable likelihood of a cold cache given the lack of locality and amount of data as the cluster grows. This is actually one possibility for improving this behaviour, if we cared at all - ensuring the number of cache lines touched is kept low, working with primitives for the token ranges and inet addresses to reduce the constant factors. This would also improve the normal code paths, not just range slices. bq. If it is slow, what is the solution? Even if we lazily materialize the ranges the run time of fetching batches of results dominates the in-memory compute of getRestrictedRanges. When we talked use cases it seems like people would using paging programmatically so only console users would see this poor performance outside of the lookup table use case you mentioned. For a lookup (i.e. small) table query, or a range query that can be serviced entirely by the local node, it is quite unlikely that the fetching would dominate when talking about timescales >= 1ms. bq. I didn't quite follow this. Are you talking about getLiveSortedEndpoints called from getRangeSlice? I haven't dug deep enough into getRangeSlice to tell you where the time in that goes exactly. I would have to do it again and insert some probes. I assumed it was dominated by sending remote requests. Yes - for your benchmark it would not have spent any much time here, since the sort would be a no-op and the list a single entry, but as the number of data centres and replication factor grows, and with use of NetworkTopologyStrategy, this could be a significant time expenditure. It will also on the aggregate affect a certain percentage of cpu time spent on all queries. However since the sort order is actually pretty consistent, sorting only when the sort order changes would be a way to eliminate this cost. bq. Benchmarking in what scope? This microbenchmark, defaults for workloads in cstar, tribal knowledge when doing performance work? Like I said, please do feel to drop this particular line of enquiry for the moment, since even with all of the above I doubt this is a pressing matter. But I don't think this is the end of the topic entirely - at some point this cost will be a more measurable percentage of.work done. But these kinds of costs are simply not a part of any of our current benchmarking methodology since our default configs avoid the code paths entirely (either by having no DCs, low RF, low node count, no tokens, and SimpleStrategy), and that is something we should address. In the meantime it might be worth having a simple short-circuit path for queries that may be answered by the local node only, though. > Determining replicas to query is very slow with large numbers of nodes or > vnodes > -------------------------------------------------------------------------------- > > Key: CASSANDRA-6976 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6976 > Project: Cassandra > Issue Type: Bug > Components: Core > Reporter: Benedict > Assignee: Ariel Weisberg > Labels: performance > Attachments: GetRestrictedRanges.java, jmh_output.txt, > jmh_output_murmur3.txt, make_jmh_work.patch > > > As described in CASSANDRA-6906, this can be ~100ms for a relatively small > cluster with vnodes, which is longer than it will spend in transit on the > network. This should be much faster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)