Diagnosing a slow query

Christopher Ambler Thu, 31 Jul 2014 07:51:40 -0700

Okay, let's attack this directly. We have a cluster of 6 machines (6 
nodes). We have an index of just under 3.5 million documents. Each document 
represents an Internet domain name. We are performing queries against this 
index to see names that exist in our index. Most queries are coming back in 
the sub-50ms range. But a bunch are taking 600ms to 900ms and, thus, 
showing up in our slow query log. If they ALL were performing at this 
speed, I'd wouldn't be nearly as confused, but it looks like only about 10% 
to 20% of the queries are "slow." That's clearly too much.


Head reports that this index looks like this:

aftermarket-2014-07-31_02-38-19
size: 424Mi (2.47Gi)
docs: 3,428,471 (3,428,471)

Here is the configuration for a typical node (they're all pretty-much the 
same). We have 2 machines in a dev data center, 2 machines in a mesa data 
center and 2 machines in a phx data center. Each of the two machines in a 
data center has a "node.zone" tag set, and, as you can see, I have the 
cluster routing awareness set to see "zone" as its marching orders. The 
data pipes between the data centers are beefy, and while I acknowledge that 
cross-DC isn't something that's generally smiled-upon, it appears to work 
fine.

Each machine has 96G of RAM. We start ES giving it 30G for the heap size. 
File descriptors are set at 64,000. Note that I've selected the memory 
mapped file system.

#
# Server-specific settings for cluster domainiq-es
#
cluster.name: domainiq-es
node.name: "Mesa-03"
node.zone: es-mesa-prod
discovery.zen.ping.unicast.hosts: ["dev2.glbt1.gdg", "m1p1.mesa1.gdg", 
"m1p4.mesa1.gdg", "p3p3.phx3.gdg", "p3p4.phx3.gdg"]
#
# The following configuration items should be the same for all ES servers
#
node.master: true
node.data: true
index.number_of_shards: 5
index.number_of_replicas: 5
index.store.type: mmapfs
index.memory.index_buffer_size: 30%
index.translog.flush_threshold_ops: 25000
index.refresh_interval: 30s
bootstrap.mlockall: true
cluster.routing.allocation.awareness.attributes: zone
gateway.recover_after_nodes: 4
gateway.recover_after_time: 2m
gateway.expected_nodes: 6
discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.timeout: 10s
discovery.zen.ping.retries: 3
discovery.zen.ping.interval: 15s
discovery.zen.ping.multicast.enabled: false

And here is a typical slow query:

[2014-07-31 07:35:31,530][WARN ][index.search.slowlog.query] [Mesa-03] 
[aftermarket-2014-07-31_02-38-19][2] took[707.6ms], took_millis[707], 
types[premium], stats[], search_type[QUERY_THEN_FETCH], total_shards[5], 
source[], 
extra_source[{"size":35,"query":{"query_string":{"query":"sld:petusies^20.0 
OR tokens:(((pet^1.2 pets^1.0 *^1.0)AND(us^1.2 *^0.8)AND(ie^1.2 
*^0.6)AND(s^1.2 *^0.4)) OR((pet^1.2 pets^1.0)AND(us^1.2)AND(ie^1.2))^3.0) 
AND tld:(com^1.001 OR in^0.99 OR co.in^0.941174367459617 OR 
net.in^0.8848832474555992 OR us^0.85 OR org.in^0.8397882862729736 OR 
gen.in^0.785829669672289 OR firm.in^0.7414549824163524 OR ind.in^0.7 OR 
org^0.6) OR 
_id:petusi.es^5.0-domaintype:partner","lowercase_expanded_terms":true,"analyze_wildcard":false}}}],
 


So note that I create 5 shards and 5 replicas, so that each node has all 5 
shards at all times. I THOUGHT THIS MEANT BETTER PERFORMANCE. That is, I 
thought having all 5 shards on every node meant that a query to a node 
didn't have to ask another node for data. IS THIS NOT TRUE?

Here's where it also gets interesting: I tried setting the number of shards 
to 2 (with 5 replicas) and my slow queries went to almost 2 seconds 
(2000ms). This is also terribly counter-intuitive! I thought fewer shards 
meant less lookup time.

Clearly, I want to optimize for read here. I don't care if indexing is 
three times as slow, we need our queries to be sub-100ms.

Any help is SERIOUSLY appreciated (and if you're in the Bay Area, I'm not 
above bribes of beer :-))

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/43b8bd8a-b20f-49de-a99d-825168095d6a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Diagnosing a slow query

Reply via email to