RE: Lock contention in pat_match_str in threaded mode

2019-10-24 Thread Brian Diekelman
Thank you for turning that around so quickly, Willy.

We'll pull down the new release when it's available.


-Original Message-
From: Willy Tarreau  
Sent: Tuesday, October 22, 2019 10:53 PM
To: Brian Diekelman 
Cc: haproxy@formilux.org
Subject: Re: Lock contention in pat_match_str in threaded mode

Notice: This email is from an external sender.



Hi Brian,

On Tue, Oct 22, 2019 at 04:19:58PM +0200, Willy Tarreau wrote:
> At the moment I don't know what it requires to break it down per 
> thread, so I'll add a github issue referencing your report so that we don't 
> forget.
> Depending on the complexity, it may make sense to backport it once 
> done, because it's still a regression.

So this morning I spent a little time on it and it was really trivial to fix. 
I've now splitted the LRU cache per thread (just as it is when using nbproc in 
fact) so that we can get rid of the lock. Result, from 67kH/s I jump to 369kH/s 
on a test with 7 threads on 4 cores :-)

I could backport it as far as 1.8 so I did it, as I really consider this as a 
regression and it can be a showstopper for some users to migrate to threads. 
The gains are less important in 1.8 since the scalability is much less linear 
there. I'm going to work on issuing some releases today or over this week, I'd 
encourage you to migrate to 2.0.8 once it's out.

Thanks again for your detailed report!
Willy



Lock contention in pat_match_str in threaded mode

2019-10-14 Thread Brian Diekelman
Just wanted to provide some information on what appears to be lock contention 
around ACL lookups.

We recently upgraded from haproxy-1.6 to haproxy-1.8.20 and switched from 
'nbprocs 8' to 'nbprocs 1, nbthreads 12'

We have quite a few ACLs files to sift through for domain matching -- about 
19MB total.

After switching to 1.8.20 we saw elevated CPU under our normal peak load, to 
the point where in some cases all 12 threads were consuming 100% CPU.

Looking at the 'perf' numbers:

  90.71%  haproxy  [.] pat_match_str
   0.49%  libc-2.12.so [.] __strncasecmp_l_sse42
   0.41%  haproxy  [.] pat_match_beg
   0.34%  haproxy  [.] lru64_get
   0.26%  [kernel] [k] _spin_lock_bh
   0.17%  haproxy  [.] pat_match_sub
   0.16%  libc-2.12.so [.] _int_malloc
   0.13%  [kernel] [k] _spin_lock
   0.10%  haproxy  [.] process_runnable_tasks
   0.09%  libc-2.12.so [.] malloc


We looked up that function, and it seemed like all requests were going through 
one spin lock [0].
The LRU tree seemed to be conditionally initialized based on 
'global.tune.pattern_cache' which seemed to be tied to the 
'tune.pattern.cache-size' config value [1] [2].
So, we specified 'tune.pattern.cache-size 0' to disable the cache code path and 
avoid the spinlock, and the CPU numbers dropped back down to where they had 
been previously (and lru64_get was no longer in the execution traces).

I realize our use case is probably an edge case (that's a lot of ACL data -- we 
have plans to upgrade these to map files now that we're on haproxy-1.8).
Just wanted to make the developers aware and help anyone who may run into this.

Thank you,
-Brian

[0] 
https://github.com/haproxy/haproxy/blob/33ccf1cce04069f27ebb868f5617672ed4e21cf4/src/pattern.c#L487
[1] 
https://cbonte.github.io/haproxy-dconv/1.8/configuration.html#3.2-tune.pattern.cache-size
[2] 
https://github.com/haproxy/haproxy/blob/33ccf1cce04069f27ebb868f5617672ed4e21cf4/src/pattern.c#L2688