Thank you, Glen. I appreciate your insight!

Here is our environment:

<https://lh3.googleusercontent.com/-PLejC0Yt98I/VS7BDRa23pI/AAAAAAAAAhk/MVoWqrRI8ls/s1600/ES%2BSetup.png>
All nodes are running in a VPC within the same region of AWS, so inter-node 
latency should be very minimal.

I was thinking the same thing about the ES LB as well. I was wondering if 
we were hitting a keepalive timeout or if the level of indirection was 
otherwise creating a problem in the process. So, I tried removing the ES LB 
between the API Server nodes (ES clients) and the Eligible Masters earlier 
today. Each API node is now configured with the private IPs of the three 
eligible masters. There was no change in the observed behaviour following 
this change.

The Load Balancer in front of the API Servers is pre-warmed to 10,000 
requests per second. And we're only throwing a couple hundred at it for the 
moment.

Thanks for the suggestion about polling various stats on the server. I'll 
see what I can rig up.

On Wednesday, April 15, 2015 at 3:38:04 PM UTC-4, Glen Smith wrote:
>
> Cool.
>
> If I read right, your response time statistics graph includes 
> 1 - network latency between the client nodes and the load balancer
> 2 - network latency between the load balancer and the cluster eligible 
> masters.
> 3 - performance of the load balancer
> My interest in checking out 1 & 2 would depend on the network topology.
> I would for sure want to do something to rule out 3. Any possibility of 
> letting at least one of the client nodes
> bypass the LB for a minute or two?
>
> Then, I might be tempted to set up a script to hit _cat/thread_pool for 60 
> seconds at a time, with various of the thread pools/fields, looking for 
> spikes.
> Maybe the same thing with _nodes/stats.
>
>
>
> On Wednesday, April 15, 2015 at 1:48:17 PM UTC-4, Daryl Robbins wrote:
>>
>> Thanks, Glen. Yes, I have run top: the Java Tomcat process is the only 
>> thing running at the time. I also checked the thread activity in JProfiler 
>> and nothing out of the ordinary popped up.
>>
>> On Wednesday, April 15, 2015 at 1:36:55 PM UTC-4, Glen Smith wrote:
>>>
>>> Have you run 'top' on the nodes?
>>>
>>> On Wednesday, April 15, 2015 at 8:56:20 AM UTC-4, Daryl Robbins wrote:
>>>>
>>>> Thanks for your response. GC was my first thought too. I have looked 
>>>> through the logs and ran the app through a profiler, I am not seeing any 
>>>> spike in GC activity or any other background thread when performance 
>>>> degrades. Also, the fact that the slowdown occurs exactly every minute at 
>>>> the same second would point me towards a more deliberate timeout or 
>>>> heartbeat.
>>>>
>>>> I am running these tests in a controlled performance environment with 
>>>> constant light to moderate load. There is no change in the behaviour when 
>>>> under very light load. I have turned on slow logging for queries/fetches 
>>>> but am not seeing any slow queries corresponding with the problem. The 
>>>> only 
>>>> time I see a slow query is post-cold start of the search node, so it is at 
>>>> least working.
>>>>
>>>> On Wednesday, April 15, 2015 at 1:00:00 AM UTC-4, Mark Walkom wrote:
>>>>>
>>>>> Have you checked the logs for GC events or similar? What about the web 
>>>>> logs for events coming in?
>>>>>
>>>>> On 15 April 2015 at 09:03, Daryl Robbins <[email protected]> wrote:
>>>>>
>>>>>> I am seeing a consistent bottleneck in requests (taking about 2+ 
>>>>>> seconds) at the same second every minute across all four of my client 
>>>>>> nodes 
>>>>>> who are connecting using the transport client from Java. These nodes are 
>>>>>> completely independent aside from their reliance on the ElasticSearch 
>>>>>> cluster and consequently they all happen to pause at the exact same 
>>>>>> second 
>>>>>> every minute. The exact second when this happens varies over time, but 
>>>>>> the 
>>>>>> four nodes always pause at the same time.
>>>>>>
>>>>>> I have 4 web nodes that connect to my ES cluster via transport. They 
>>>>>> connect to a load balancer fronting our 3 dedicated master nodes. The 
>>>>>> cluster contains 2 or more data nodes dependent on the configuration. 
>>>>>> Regardless of the number, I am seeing the same symptoms.
>>>>>>
>>>>>> Any hints on how to proceed to troubleshoot this issue on the 
>>>>>> ElasticSearch side would be greatly appreciated. Thanks very much!
>>>>>>
>>>>>>
>>>>>> <https://lh3.googleusercontent.com/-GKiOcsPXBjI/VS2ak04mzBI/AAAAAAAAAhQ/aLDlD82AddY/s1600/Screenshot%2B2015-04-14%2B18.53.24.png>
>>>>>>
>>>>>>
>>>>>>  -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "elasticsearch" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/elasticsearch/af209904-9113-43d0-8cbc-0c85afe52611%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/77864063-f689-401d-b6bb-080481c6b63f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to