Re: Buffer limits when adding a large number of CA certs into one ca-file via socket

2022-08-17 Thread Lais, Alexander
Dear William,

Thank you. We will adjust our planning accordingly.

Kind regards,
Alex

> On 16. Aug 2022, at 15:24, William Lallemand  wrote:
> 
> On Tue, Aug 16, 2022 at 11:16:43AM +0000, Lais, Alexander wrote:
>> Hi William,
>> 
>> Thank you! I figured you were on holidays. A lot of our team are as well.
>> 
>> Do you see this being back ported to 2.5 / 2.6 (LTS) as well?
> 
> Unfortunately we usually don't backport this kind of features as this is
> an API change and could break things.
> The stable branches are meant to be maintenance only.
> 
> Also it will probably need some adjustments and new keywords to remove a
> specific index in the file and that kind of things.
> 
> -- 
> William Lallemand




Re: Buffer limits when adding a large number of CA certs into one ca-file via socket

2022-08-16 Thread Lais, Alexander
Hi William,

Thank you! I figured you were on holidays. A lot of our team are as well.

Do you see this being back ported to 2.5 / 2.6 (LTS) as well?

Thanks and kind regards,
Alex

> On 16. Aug 2022, at 11:07, William Lallemand  wrote:
> 
> On Thu, Aug 04, 2022 at 11:57:16AM +0000, Lais, Alexander wrote:
>> Hi William,
>> 
>> thanks again for the PoC you referenced in the GitHub issue.
>> This would solve the use case for us and would fix the ca-cert editing / 
>> updating feature introduced in HAProxy 2.5.
>> 
>> Can we support further with the development, be it with code or testing, to 
>> get from this PoC to a full fix in one of next release streams?
>> 
>> Thanks and kind regards,
>> Alex
>> 
> Hello Alex,
> 
> Sorry for the late reply, I was in vacation for a few days. 
> 
> I'm going to finish the development and tests for the feature so this
> could be integrated for the next 2.7 major version.
> 
> Regards,
> 
> -- 
> William Lallemand




Buffer limits when adding a large number of CA certs into one ca-file via socket

2022-07-26 Thread Lais, Alexander
Dear all,

We are now using the new feature of adding CA files dynamically via the stats / 
admin socket.

Assuming that the CA file does not exist yet, our understanding is that we:

1. Create a CA file (new ssl ca-file customer-cas.pem)

2. Set the content of the CA file with payload notation;
"set ssl ca-file customer-cas.pem <<\n[a bunch of PEM blocks]\n”

3. Commit the CA file (commit ssl ca-file customer-cas.pem)

In step 2 we are reaching the limit of the global buffer size (defined via 
tune.bufsize, ours is tuned to ca. 71k, allowing for a comfortable 64k of 
headers).
Some of the CA files that we want to add are larger than this buffer and are 
not properly processed by the CLI.

It is understandable that the CLI socket needs some buffer and that this buffer 
is limited.
That said, reading the CA files data from disk does not pose any (perceivable) 
size limit. We recently implemented a dynamic update to avoid having to reload 
the HAProxy process whenever there was a change, and ran into this issue.

We’ve added a feature request on GitHub: 
https://github.com/haproxy/haproxy/issues/1805

This e-mail is to ask whether maybe we have overlooked something in terms of 
configuration possibilities, either for the socket or on how to use the CLI for 
creating ca-files?

Thanks in advance,
Alex


Re: Granular rate-limits, metrics and stick-tables

2022-05-13 Thread Lais, Alexander
Hi Tristan,

I can’t add anything (yet) besides saying thank you for the write up.
I’m mostly writing this because I don’t see the message in the mailing list 
archive and found the actual mail in my junk mail folder for some reason.

Cheers,
Alex

> On 11. May 2022, at 21:34, Tristan  wrote:
> 
> Hi,
> 
> I'm trying to find out a better approach than I've used so far for relatively 
> complex rate-limits management with HAProxy.
> 
> The vast majority of documentation out there focuses on a single global 
> rate-limit, or with a single proxy, with minimal documentation on more 
> granual approaches.
> 
> Unfortunately I'm afraid that without context it will not make much sense, so 
> apologies in advance for the long message...
> 
> ---
> 
> Here's what I'm aiming for:
> 
> 1. Rate-limit "zones" (in nginx parlance)
> 
> Essentially, arbitrary groups of paths/domains/backends/etc that share common 
> rate-limit thresholds and counters.
> 
> That bit isn't terribly complex and a few ACLs do the trick just fine.
> 
> 2. A tiered rate-limit system
> 
> That is, various levels of "infringement" with their own triggers and actions.
> 
> We have a few classes of traffic at all times:
> - Normal: the traffic we love, but irrelevant for this discussion
> - Silly: Public API + many consumers + varying levels of expertise = 
> occasional spam from the likes of looping requests ad-nauseam despite 429s
> - Infringing: attempting to abuse our services in a way or another 
> (more-or-less polite scrapers, ToS abusers, ...)
> - Malicious: The annoying part of the internet (skiddies and their crappy 
> booters, IPv4 space scanners leaking non-public IPs, vulnerability exploit 
> attempts, ...)
> 
> So while we are typically happy to be heavy-handed in general (users != 
> revenue for us, and we have limited access to compute, so the choice isn't 
> very hard) and issue blanket bans, we'd also like legitimate-but-misguided 
> users to have some leeway.
> 
> Finally, some of the "infringing" behaviors are more complex to detect (ie 
> typically rely on faked but believable headers and require 
> multi-requests-pattern rules for reliable detection) and we want to react to 
> those in more elaborate ways than just banning them after 1 request (maybe 
> split responses between fake data, invalid responses and conn resets; just 
> being creative in hinting at them to go annoy someone else), as that would 
> reveal that we identified them and allow them to test evasion methods a 
> little too easily.
> 
> 3. Tracking a couple of other interesting dimensions besides this
> 
> Could be anything ACL-able. For example TLSv1.2 vs. TLSv1.3 adoption.
> This is essentially about making up extra metrics we are interested in for 
> anything ACL-able we might care about.
> 
> 4. Being able to track sources/requests flagged with Prometheus
> 
> We can have 99 GPCs per table, so in theory none of this is an issue, however 
> that data isn't super easily accessible as-is:
> - A log-based approach isn't fine due to resource constraints
> - I'd prefer not having to introduce some admin API parser to extract and 
> process data this way if I can avoid it
> - On the other hand, we have a very streamlined/cheap/scalable/etc long-term 
> Prometheus setup, so this is our preferred approach, and HAProxy does nice 
> things like per-stick-table entry counts out of the box, so using that is 
> ideal. Same idea for frontend-level `http-request return` versus dedicated 
> backends
> 
> ---
> 
> Now that the requirements are hopefully a bit clear, here's the general 
> approach I came up with
> 
> #--- First, a few stick-tables since the number of entries is exported as 
> prom metrics
> 
> # Global source concurrent connections
> backend st_conns from defaults-base
>stick-table type ip size 100k expire 300s store conn_cur
> 
> # Generic rate-limits, 1 gpc+gpc_rate per zone (I'm fine with not having a 
> dedicated metric per zone as-is)
> backend st_ratelimits from defaults-base
>stick-table type ip size 100k expire 300s store gpc(2),gpc_rate(2,60s)
> 
> # For multi-requests pattern analysis we count infringing requests and flag 
> infringers after enough "suspicious" requests only to avoid false-positives
> backend st_infringing_grace from defaults-base
>stick-table type ip size 30k expire 600s store gpc(1),gpc_rate(1,60s)
> 
> # Generic counter for silly requests (for example any request that we reject 
> due to rate-limits...)
> backend st_badreqs from defaults-base
>stick-table type ip size 30k expire 600s store gpc(1),gpc_rate(1,60s)
> 
> # Soft bans, ie you get a response page telling you you're banned
> backend st_ban_soft from defaults-base
>stick-table type ip size 30k expire 600s store gpc(1),gpc_rate(1,60s)
> 
> # Hard bans, ie we silent-drop the requests
> backend st_ban_hard from defaults-base
>stick-table type ip size 30k expire 600s store gpc(1),gpc_rate(1,60s)
> 
> #--- Then a common default for our frontends
> # 

Check interval rise and fall behaviour

2022-03-29 Thread Lais, Alexander
Dear all,

We are using the backend health checks to disable flapping backends.

The default values for rise and fall are 2 subsequent succeeded and 3 
subsequent failed checks.

Our check interval is at 1000ms (a little frequent, potentially part of the 
problem).

Here is what we observed, using HAProxy 2.4.4:

1. Falling

It started with the backend being up and then going down (fall).

> 2022-03-23T21:31:54.942Z  Health check for server 
> http-routers-http1/node4 failed, reason: Layer4 timeout, check duration: 
> 1000ms, status: 2/3 UP.
> 2022-03-23T21:31:56.920Z  Health check for server 
> http-routers-http1/node4 failed, reason: Layer4 timeout, check duration: 
> 1001ms, status: 1/3 UP.
> 2022-03-23T21:31:57.931Z  Health check for server 
> http-routers-http1/node4 succeeded, reason: Layer7 check passed, code: 200, 
> check duration: 1ms, status: 3/3 UP.
> 2022-03-24T10:03:27.223Z  Health check for server 
> http-routers-http1/node4 failed, reason: Layer7 wrong status, code: 503, 
> info: "Service Unavailable", check duration: 1ms, status: 2/3 UP.
> 2022-03-24T10:03:28.234Z  Health check for server 
> http-routers-http1/node4 failed, reason: Layer7 wrong status, code: 503, 
> info: "Service Unavailable", check duration: 1ms, status: 1/3 UP.
> 2022-03-24T10:03:29.237Z  Health check for server 
> http-routers-http1/node4 failed, reason: Layer7 wrong status, code: 503, 
> info: "Service Unavailable", check duration: 1ms, status: 0/2 DOWN.

We go down from 3/3 to 2/3, 1/3 and back up again to 3/3. My assumption is that 
it then measured 2/3, but only needs 2 for rising, i.e. 2/2, which is bumped to 
3/3 as the backend is now considered up.

The backend stays up for a while and then goes down with my expected health 
checks, i.e. 3/3, 2/3, 1/3, 0/3 -> 0/2 (as we need 2 for rise).

2. Rising

> 2022-03-24T10:12:26.846Z  Health check for server 
> http-routers-http1/node4 failed, reason: Layer4 timeout, check duration: 
> 1000ms, status: 0/2 DOWN.
> 2022-03-24T10:12:29.843Z  Health check for server 
> http-routers-http1/node4 failed, reason: Layer4 connection problem, info: 
> "Connection refused", check duration: 1ms, status: 0/2 DOWN.
> 2022-03-24T10:13:43.902Z  Health check for server 
> http-routers-http1/node4 failed, reason: Layer7 wrong status, code: 503, 
> info: "Service Unavailable", check duration: 2ms, status: 0/2 DOWN.
> 2022-03-24T10:14:03.039Z  Health check for server 
> http-routers-http1/node4 succeeded, reason: Layer7 check passed, code: 200, 
> check duration: 1ms, status: 1/2 DOWN.
> 2022-03-24T10:14:04.079Z  Health check for server 
> http-routers-http1/node4 succeeded, reason: Layer7 check passed, code: 200, 
> check duration: 1ms, status: 3/3 UP.

So coming up (rise), it goes from 0/2 probes to 1/2 to 3/3. My assumption that 
it goes to 2/2, is considered up and is bumped to 3/3 because for fall we now 
need 3 failed probes.


The documentation describes rise / fall as “number of subsequent probes that 
succeeded / failed.
From my observations it looks like it is a sliding window of the last n being 
successful, i.e. when the number of fall is larger than rise, it is easier to 
rise back up with a single successful probe.

Maybe I’m misreading the log outputs or drawing the wrong conclusions.

If someone knows by heart how it’s supposed to work based on the code that 
would be great. Otherwise we can dig some more ourselves.

Thanks and kind regards,
Alex


ACL execution order, short circuit behaviour?

2022-02-28 Thread Lais, Alexander
Dear all,

I’m trying to understand, how ACL chains, e.g. for `http-request deny` are 
executed, and whether they support short-circuit.


Example:

acl1: ip in particular range
acl2: comple regex match with a long list of patterns

http-request deny acl1 !acl2


That would mean block the request if it fits the IP range of acl1 and does not 
match any of the patterns in the list of regexes.

I want to understand, whether the evaluation is stopped after acl1 did not 
match, or if the long list of regexes is still executed?

My programmer’s intuition would expect that execution would stop when acl1 does 
not match.
The documentation mentions that unused ACLs don’t have performance impact, 
which would indicate the same.
When acl1 matches, of course the long list of regexes must be processed.

Thanks in advance,
Alex