Ok, I have deleted everything and restarted the ES service, on both nodes, and rebuilt the indexes. I have changed my shard count to 1 -- thank you for explaining that, I didn't realize the overhead was as large as it was. I do have some indexes that are much larger than others. The largest one has around 400k issues. Search performance seems fine with the 1 shard though, so I'll leave it at that. I can introduce a per-project shard count setting if needed later on.
Results: My search count issue appears to be resolved, it was probably related to my large shard count. However, the slow queries once every 10 minutes remains. It's even still on the same schedule -- always at wall clock time with 9 minutes, e.g. 12:29pm. Here is a hot_threads call I ran while the slow searches were running (I replaced machine names and IPs with tokens for obscurity): ::: [<node1>][TbKES0BpQ---dAP1J9kKKQ][<node1>][inet[/<node1IP>:9300]] Hot threads at 2015-04-21T19:29:27.841Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true: ::: [<node2>][aPof4TmzQlm7Ck7frfGeuA][<node2>][inet[/<node2IP>:9300]] Hot threads at 2015-04-21T19:29:26.250Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true: 0.0% (0s out of 500ms) cpu usage by thread 'Attach Listener' unique snapshot unique snapshot unique snapshot unique snapshot unique snapshot unique snapshot unique snapshot unique snapshot unique snapshot unique snapshot Here is the slow search log for the same timeframe (first # is the time to get the response, second # is the .took field from the ES response json) 4/21/2015, 12:29:27 PM <node1> 8586 8361 4/21/2015, 12:29:27 PM <node1> 9681 9454 4/21/2015, 12:29:27 PM <node2> 7685 7457 There's nothing in ES logs after the indexes were initially created, and it's just info type things like this: [2015-04-21 10:48:05,029][INFO ][cluster.metadata ] [<node1>] updating number_of_replicas to [1] for indices [<indexname>] It's so interesting that it's every 10 minutes and always at x9 minutes, regardless of the fact that I've restarted ES. There still could be an external reason for this, but like I said, there's no CPU usage or Disk I/O going on during these periods, so it's very strange. I have no leads other than these slow search times. Thanks so much for your help. Any other troubleshooting ideas would be greatly appreciated. -Dave On Tuesday, April 21, 2015 at 12:12:38 AM UTC-7, David Pilato wrote: > > Some notes. You are using defaults. So you have 5 shards per index. 1000 > primary shards. > With replicas, it means 1000 shards per Node. Which means 1000 Lucene > instances. > > First thing to do is IMO to use only one shard per index unless you need > more for "big" indices. > > Then, have 3 nodes and set minimum_master_nodes to 2. You probably ran > into a split brain issue which could explain the difference you are seeing. > I would probably set replica to 0 and then to 1 again. But if you follow > my first advice, you have to reindex, so reindex with 1 shard, 0 replica, > then set replica to 1. > > > My 2 cents > > David > > Le 21 avr. 2015 à 08:31, Dave Reed <infin...@gmail.com <javascript:>> a > écrit : > > The logs are basically empty. There's activity from when I am creating the > indexes, but that's about it. Is there a logging level that could be > increased? I will run hot_thread as soon as I can in the morning and post > results when I can. > > I have each index set to 1 replica. If it matters, I first import them > with 0 replicas then set it to 1 when they are done. Shard count I will > have to check and get back to you, but it's whatever the default would be, > we haven't tweaked that. > > I have 200+ indexes because I have 200+ different "projects" represented, > and each one has it's own set of mappings. Mappings which could collide on > name. I originally tried having a single index with each project > represented by a Type instead, but a conversation I had on this forum about > that led me away from it due to the fact two different projects may > sometimes have the same field names but with different types. Most searches > (99.9%) are done on a per-project basis, and changes to projects can create > a need to reindex the project, so having the segregation is nice, lest I > have to reimport the entire thing. > > In case this is related, I also found that a document was changed and > reindexed, but searches would sometimes include it and sometimes not. I > could literally just refresh the search over and over again, and it would > appear in the results roughly 50% of the time. 0 results, then 0, 0, then 1 > result, 1, 1, then 0 again, etc.. I was running the search against one of > the nodes via _head. The cluster health was green during all of this. That > was surprising and something I was going to investigate more on my own, but > perhaps these two problems are related. > > I'm really hitting the limit of what I know how to troubleshoot with ES, > hence I am really hoping for help here :) > > > On Monday, April 20, 2015 at 10:23:46 PM UTC-7, David Pilato wrote: >> >> Could you run a hot_threads API call when this happens? >> Anything in logs about GC? >> >> BTW 200 indices is a lot for 2 nodes. And how many shards/replicas do you >> have? >> Why do you need so many indices for 2m docs? >> >> >> David >> >> Le 21 avr. 2015 à 01:16, Dave Reed <infin...@gmail.com> a écrit : >> >> I have a 2-node cluster running on some beefy machines. 12g and 16g of >> heap space. About 2.1 million documents, each relatively small in size, >> spread across 200 or so indexes. The refresh interval is 0.5s (while I >> don't need realtime I do need relatively quick refreshes). Documents are >> continuously modified by the app, so reindex requests trickle in >> constantly. By trickle I mean maybe a dozen a minute. All index requests >> are made with _bulk, although a lot of the time there's only 1 in the list. >> >> Searches are very fast -- normally taking 50ms or less. >> >> But oddly, exactly every 10 minutes, searches slow down for a moment. The >> exact same query that normally takes <50ms takes 9000ms, for example. Any >> other queries regardless of what they are also take multiple seconds to >> complete. Once this moment passes, search queries return to normal. >> >> I have a tester I wrote that continuously posts the same query and logs >> the results, which is how I discovered this pattern. >> >> Here's an excerpt. Notice that query time is great at 3:49:10, then at >> :11 things stop for 10 seconds. At :21 the queued up searches finally come >> through. The numbers reported are the "took" field from the ES search >> response. Then things resume as normal. This is true no matter which node I >> run the search against. >> >> This pattern repeats like this every 10 minutes, to the second, for days >> now. >> >> 3:49:09, 47 >> 3:49:09, 63 >> 3:49:10, 31 >> 3:49:10, 47 >> 3:49:11, 47 >> 3:49:11, 62 >> 3:49:21, 8456 >> 3:49:21, 5460 >> 3:49:21, 7457 >> 3:49:21, 4509 >> 3:49:21, 3510 >> 3:49:21, 515 >> 3:49:21, 1498 >> 3:49:21, 2496 >> 3:49:22, 2465 >> 3:49:22, 2636 >> 3:49:22, 6506 >> 3:49:22, 7504 >> 3:49:22, 9501 >> 3:49:22, 4509 >> 3:49:22, 1638 >> 3:49:22, 6646 >> 3:49:22, 9641 >> 3:49:22, 655 >> 3:49:22, 3667 >> 3:49:22, 31 >> 3:49:22, 78 >> 3:49:22, 47 >> 3:49:23, 47 >> 3:49:23, 47 >> 3:49:24, 93 >> >> I've ruled out any obvious periodical process running on either node in >> the cluster. It's pretty clean. Disk I/O, CPU, RAM, etc stays pretty >> consistent and flat even during one of these blips. These are beefy >> machines like I said, so there's plenty of CPU and RAM available. >> >> Any advice on how I can figure out what ES is waiting for would be >> helpful. Is there any process ES runs every 10 minutes by default? >> >> Thanks! >> -Dave >> >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearc...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/0c715e1e-f45b-46fb-ae39-52318e126203%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/0c715e1e-f45b-46fb-ae39-52318e126203%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> >> -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearc...@googlegroups.com <javascript:>. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/ee7e43a0-a4c4-4ff9-9898-2026c586ab77%40googlegroups.com > > <https://groups.google.com/d/msgid/elasticsearch/ee7e43a0-a4c4-4ff9-9898-2026c586ab77%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1be341f1-684c-4122-be67-ade44cdef1bb%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.