Hello Dave,

On Fri, Sep 11, 2015 at 04:36:44PM -0400, Dave Stern wrote:
> We run haproxy 1.5.14 on Ubuntu 14.04 in AWS. It load balances connections
> from heroku to our backend, a neo4j cluster of multiple instances. It also
> terminates SSL and handles auth. Connections to the backend are over
> private network space within our AWS network and via unencrypted http.
> 
> During a recent event with heavy load, we saw log entries with the same
> repeating pattern of multiple lines with the exact same Tq, Tc, Tr and Tt
> values, but with different requests from different clients routed to
> different backend pool members. We're not sure what's causing this or if it
> relates to performance issues we saw during our events. Here's some log
> lines with excerpted fields. We send a custom header with the query name
> for metrics, which I've replaced with generic entries. I also replaced the
> real client IP with 1.1.1.1, 2.2.2.2, etc.
> 
> 
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=1.1.1.1:37480
> bs=db_production_read:production-04 hdrs="{query_1}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=1.1.1.1:37481
> bs=db_production_read:production-02 hdrs="{query_2}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=1.1.1.1:37482
> bs=db_production_read:production-03 hdrs="{query_3}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=1.1.1.1:37484
> bs=db_production_read:production-04 hdrs="{query_4}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=1.1.1.1:37483
> bs=db_production_read:production-02 hdrs="{query_5}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=1.1.1.1:37485
> bs=db_production_read:production-03 hdrs="{query_6}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=1.1.1.1:37486
> bs=db_production_read:production-05 hdrs="{query_7}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=2.2.2.2:37488
> bs=db_production_read:production-03 hdrs="{query_8}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=1.1.1.1:37487
> bs=db_production_write:production-01 hdrs="{query_9}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=1.1.1.1:37488
> bs=db_production_read:production-04 hdrs="{query_10}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=1.1.1.1:37489
> bs=db_production_read:production-02 hdrs="{query_11}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=1.1.1.1:37490
> bs=db_production_read:production-03 hdrs="{query_12}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=1.1.1.1:37491
> bs=db_production_write:production-01 hdrs="{query_9}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=1.1.1.1:37492
> bs=db_production_read:production-04 hdrs="{query_13}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=3.3.3.3:53958
> bs=db_production_read:production-02 hdrs="{query_10}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=3.3.3.3:53959
> bs=db_production_read:production-03 hdrs="{query_11}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=3.3.3.3:53960
> bs=db_production_read:production-04 hdrs="{query_12}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=2.2.2.2:37489
> bs=db_production_write:production-01 hdrs="{query_9}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=1.1.1.1:37493
> bs=db_production_write:production-01 hdrs="{query_9}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=3.3.3.3:53961
> bs=db_production_write:production-01 hdrs="{query_9}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=2.2.2.2:37490
> bs=db_production_write:production-01 hdrs="{query_9}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=3.3.3.3:53962
> bs=db_production_write:production-01 hdrs="{query_14}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=2.2.2.2:37491
> bs=db_production_write:production-01 hdrs="{query_9}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=3.3.3.3:53963
> bs=db_production_write:production-01 hdrs="{query_9}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=1.1.1.1:37495
> bs=db_production_write:production-01 hdrs="{query_9}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=3.3.3.3:53964
> bs=db_production_write:production-01 hdrs="{query_9}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=3.3.3.3:53965
> bs=db_production_read:production-02 hdrs="{query_13}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=1.1.1.1:37496
> bs=db_production_write:production-01 hdrs="{query_9}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=1.1.1.1:37498
> bs=db_production_read:production-04 hdrs="{query_15}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=2.2.2.2:37492
> bs=db_production_read:production-02 hdrs="{query_14}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=1.1.1.1:37499
> bs=db_production_write:production-01 hdrs="{query_13}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=2.2.2.2:37493
> bs=db_production_read:production-03 hdrs="{query_10}"
> Sep 10 20:00:00 Tq=1060 Tw=0 Tc=1204 Tr=922 Tt=3186 cip=2.2.2.2:37495
> bs=db_production_read:production-02 hdrs="{query_12}"
> 
> The logs repeat like this with different sets of entries with the same Tq,
> Tc, Tr and Tt for all lines. Is this a known phenomena?

I don't see a problem here, but unfortunately you're missing important
information in your logs, such as the position in the queue. Here
what happened is that you received many requests at once and the
server did not respond for some time (maybe it was stuck doing some
heavy work) and suddenly it released some room and all pending
requests could be processed.

To be honnest I find it surprizing that the connect time is the same
for all of them for all requests, and even more that the response time
is the same. I would have understood it from a very small time but here
we're almost at one second so such long response times generally tend
to be distributed more widely.

> A bug in the logging or code?

I don't believe in this because other fields differ (eg: source address
and port) and the log lines are built from scratch for all requests, using
information picked in the session itself.

> Some kind of connection reuse we're not aware of?

No, as you can see the sources (address, port) differ.

There could be another option : maybe you're running in a VM with a broken
clock jumping back and forth (as in many VMs) and the time you're seeing
is the result of time correction because you have only a few discrete, valid
time events :
  - the time all connections were accepted
  - the time all connections got dequeued
  - the time all of them managed to connect
  - the time all responses were received

For example if all events spanned over 1 second and you had only 5 different
clock values within this second, you'll get all similar values everywhere as
long as events differ by less than the difference between them (eg: 200ms).

Hoping this helps,
Willy


Reply via email to