Well, it appears that this issue was actually unrelated to ElasticSearch 
after all. The problem was actually between the API Load Balancer and the 
API Server nodes. We are using ElasticBeanstalk, a managed application 
container, to host these API nodes. It turns out the Apache configuration 
was wrong in the Amazon gold image. The keep-alive and timeout settings 
were not set properly, resulting in timeouts on the load balancer every 
minute, which resulted in the massive spike in response time.

In the performance environment, there is long enough between tests for all 
the connections to time out form the LB. So, when the load starts up, all 
the connections to all the nodes are established at once, which is why they 
were all on the same schedule and also explains why this schedule would 
change over time (it was dependent on what time the test started).

Since correcting the Apache configuration, I have got the occurrence rate 
of requests taking longer than 500 ms down to 0.004% from 3%. And 
generally, these requests take 1 second instead of 2 - 4 seconds.

So, I would say that it is not resolved except for a little more 
optimization.

Thank you every one for your responses.

On Wednesday, April 15, 2015 at 4:05:27 PM UTC-4, Daryl Robbins wrote:
>
> Thank you, Glen. I appreciate your insight!
>
> Here is our environment:
>
>
> <https://lh3.googleusercontent.com/-PLejC0Yt98I/VS7BDRa23pI/AAAAAAAAAhk/MVoWqrRI8ls/s1600/ES%2BSetup.png>
> All nodes are running in a VPC within the same region of AWS, so 
> inter-node latency should be very minimal.
>
> I was thinking the same thing about the ES LB as well. I was wondering if 
> we were hitting a keepalive timeout or if the level of indirection was 
> otherwise creating a problem in the process. So, I tried removing the ES LB 
> between the API Server nodes (ES clients) and the Eligible Masters earlier 
> today. Each API node is now configured with the private IPs of the three 
> eligible masters. There was no change in the observed behaviour following 
> this change.
>
> The Load Balancer in front of the API Servers is pre-warmed to 10,000 
> requests per second. And we're only throwing a couple hundred at it for the 
> moment.
>
> Thanks for the suggestion about polling various stats on the server. I'll 
> see what I can rig up.
>
> On Wednesday, April 15, 2015 at 3:38:04 PM UTC-4, Glen Smith wrote:
>>
>> Cool.
>>
>> If I read right, your response time statistics graph includes 
>> 1 - network latency between the client nodes and the load balancer
>> 2 - network latency between the load balancer and the cluster eligible 
>> masters.
>> 3 - performance of the load balancer
>> My interest in checking out 1 & 2 would depend on the network topology.
>> I would for sure want to do something to rule out 3. Any possibility of 
>> letting at least one of the client nodes
>> bypass the LB for a minute or two?
>>
>> Then, I might be tempted to set up a script to hit _cat/thread_pool for 
>> 60 seconds at a time, with various of the thread pools/fields, looking for 
>> spikes.
>> Maybe the same thing with _nodes/stats.
>>
>>
>>
>> On Wednesday, April 15, 2015 at 1:48:17 PM UTC-4, Daryl Robbins wrote:
>>>
>>> Thanks, Glen. Yes, I have run top: the Java Tomcat process is the only 
>>> thing running at the time. I also checked the thread activity in JProfiler 
>>> and nothing out of the ordinary popped up.
>>>
>>> On Wednesday, April 15, 2015 at 1:36:55 PM UTC-4, Glen Smith wrote:
>>>>
>>>> Have you run 'top' on the nodes?
>>>>
>>>> On Wednesday, April 15, 2015 at 8:56:20 AM UTC-4, Daryl Robbins wrote:
>>>>>
>>>>> Thanks for your response. GC was my first thought too. I have looked 
>>>>> through the logs and ran the app through a profiler, I am not seeing any 
>>>>> spike in GC activity or any other background thread when performance 
>>>>> degrades. Also, the fact that the slowdown occurs exactly every minute at 
>>>>> the same second would point me towards a more deliberate timeout or 
>>>>> heartbeat.
>>>>>
>>>>> I am running these tests in a controlled performance environment with 
>>>>> constant light to moderate load. There is no change in the behaviour when 
>>>>> under very light load. I have turned on slow logging for queries/fetches 
>>>>> but am not seeing any slow queries corresponding with the problem. The 
>>>>> only 
>>>>> time I see a slow query is post-cold start of the search node, so it is 
>>>>> at 
>>>>> least working.
>>>>>
>>>>> On Wednesday, April 15, 2015 at 1:00:00 AM UTC-4, Mark Walkom wrote:
>>>>>>
>>>>>> Have you checked the logs for GC events or similar? What about the 
>>>>>> web logs for events coming in?
>>>>>>
>>>>>> On 15 April 2015 at 09:03, Daryl Robbins <[email protected]> wrote:
>>>>>>
>>>>>>> I am seeing a consistent bottleneck in requests (taking about 2+ 
>>>>>>> seconds) at the same second every minute across all four of my client 
>>>>>>> nodes 
>>>>>>> who are connecting using the transport client from Java. These nodes 
>>>>>>> are 
>>>>>>> completely independent aside from their reliance on the ElasticSearch 
>>>>>>> cluster and consequently they all happen to pause at the exact same 
>>>>>>> second 
>>>>>>> every minute. The exact second when this happens varies over time, but 
>>>>>>> the 
>>>>>>> four nodes always pause at the same time.
>>>>>>>
>>>>>>> I have 4 web nodes that connect to my ES cluster via transport. They 
>>>>>>> connect to a load balancer fronting our 3 dedicated master nodes. The 
>>>>>>> cluster contains 2 or more data nodes dependent on the configuration. 
>>>>>>> Regardless of the number, I am seeing the same symptoms.
>>>>>>>
>>>>>>> Any hints on how to proceed to troubleshoot this issue on the 
>>>>>>> ElasticSearch side would be greatly appreciated. Thanks very much!
>>>>>>>
>>>>>>>
>>>>>>> <https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png>
>>>>>>>
>>>>>>>
>>>>>>>  -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "elasticsearch" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/cb3a1f30-b1b9-41ac-bbdb-da94135a3f6e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to