Alan,

My original reply was confusingly brief. I've clarified below, and I've also 
put the module we wrote into github in case it helps:

https://github.com/claudebrown/freeradius-server/compare/master...rlm_tagfiles

(about 60 lines of C beyond usual module plumbing; 250 lines in total)


Alan DeKok wrote:
> 
> > - Allow high rate of user-by-user updates; i.e. avoid config re-write as
> per
> > "rlm_fastfile"
> 
>   ?  The "fastusers" module is deprecated, because the "files" module is
> just as fast.  The "files" module also can be HUP'd, so it can be
> reloaded on the fly.

We avoided both "fastfile" and reloading "files" on the fly because of the 
number of updates we have to our user setup.  The rate of change to our 
customers would require a reload every few seconds during most of the day.

We had concerns in two areas:
- The time to re-write the config and then re-load so frequently. This may 
become a performance problem as our user base grows out to 250K
- The risk of using the reload mechanism in a way that didn't seem consistent 
with its design intent, or the likely usage pattern of reloads every day or 
every few hours.

> > - Simple for stability: no shared in-memory state (avoid locking and
> races)
> 
>   The server core takes care of that when the "files" module is reloaded.
> 

These "Simple for stability" points were goals for our code. It wasn't 
something we were worried about for the existing code-base.

FreeRADIUS core is very stable. But MySQL adds instability we have been unable 
to identify or reproduce in our environment.

A crucial success factor for us was to ensure our module code was so simple it 
was very easy to be confident that stability was maintained. The strategy was 
to minimise the amount of software outside FreeRADIUS core.

> 
>   Daily config reloads are easy.
> 

Agreed. If we only needed daily, the "files" module would be perfect.

>   Say you have a format similar to the "users" file, with one user per
> file.  Loading 100K users will mean 100K file reads, and that can take a
> long time.

The module doesn't re-implement the "users" format or have a "users" file for 
every user.  It does not read 100K (or even 10) files at start-up.

The "files" module is used directly with a single normal "users" file just as 
per any normal FreeRADIUS deployment.


> > We acheived all these goals and can now process bring all our customers
> > back onto our service in about five minutes. 
> 
>   5 minutes for what, exactly?
> 

When large parts of our WiMAX network are restarted due to maintenance or 
failure the customer devices re-join the network. Whilst this doesn't happen 
often, when it does happen we need to get as many as 50K devices will 
simultaneously ask to rejoin the network.  We need to service this sudden and 
dramatic backlog as quickly as possible.

With the "files" module this is a breeze with a single server.  It just eats it 
up and everything comes back in a few minutes. Importantly, our testing shows 
the design goal of 250K users would also be met with one server.

But with "rlm_sql" and MySQL we could not do it. The radiusd would start slowly 
grinding to a halt roughly as we reached 200 auths per sec (with EAP, this is 
about 30 devices per sec).  The radiusd log reported "Unresponsive child" in a 
MySQL module and gradually all the database concurrency would disappear as 
those threads were lost for further work.

After a lot of effort testing and experimenting with all sorts of things to 
isolate or avoid this problem, we did get a lot of improvement. But mostly what 
we achieved was a drop in the probability of losing threads. Inevitably the 
next larger network-outage event would re-trigger the issue.

With our new far simpler approach, all of this has gone away because we are now 
using the "files" module and "users" file directly. The speed of authentication 
is essentially as per that module.

Our new module adds an extra attribute to the Access-Request prior to it being 
processed by module "files".  The extra attribute can be any text attribute (we 
use "Reply-Message" to be perverse) and can have any value.  Normal "files" 
matching (typically used DEFAULT entries) is used to determine the attributes 
in the Access-Response.

The value of the extra attribute is in essence obtained like this:
1. Format a filename such as "/blah/%{Username}"
2. Read a line from this file

We only have about 10 different values in these files: things like 
"voip-customer", "payment-overdue", "gold-customer", "exceeded-download-limit", 
etc.  The value is used to select a DEFAULT entry in the "users" file that 
builds the reply attributes needed to configure the customers service.

This adds marginal overhead so performance is barely different to a vanilla 
"files" module.  The cost is one i-node per customer and a few 100 lines of C 
code. We are more than happy with that cost.

Outside calls to FreeRADIUS code, the module pretty much just calls "fopen", 
"fgets" and "fclose". So it's dreadfully simple and doesn't have any concerns 
with thread safety, locking, race conditions, etc.

> 
> > With "rlm_sql" it would take an hour or two only then with careful (and
> > human driven) rate management.
> 
>   I'm not sure what that means.  An hour or two to load SQL?  What is it
> doing?
> 

This happens when we have a major network event that causes lots of devices to 
simultaneously request authentication. Due to the unpredictable loss of 
threads, we have to manually manage the rate of the incoming authentications by 
slowly starting small sections of the network at a time.

This process takes us hours of careful (manual) rate management.


> > The main issues driving this delay were:
> > - "rlm_sql" calls during EAP negotation instead of just at the end of
> EAP
> 
>   That can be fixed without a new module.
> 

Possibly, but we couldn't find a way. We would be keen to understand the fix 
for this.


> > - Performance issues on our MySQL backend that we didn't have budget to
> > resolve
> > - Thread lock-up's inside MySQL library yet no MySQL server queries were
> > active
> 
>   I've seen lots of people running MySQL with 300K+ users, and no
> problems.  The system needs to be designed carefully, but it *does* work.
> 

We had no problem during normal operation.  It was only when large numbers of 
devices (typically 10K or more) simultaneously needed to re-join the network 
for some reason. 

Do you know if these other sites have those kinds of events?

> 
> It really sounds like your *architecture* is wrong.  Find that and fix it.

I don't agree. We are not simply hitting a performance limit. That did happen, 
but it was resolved by using:
- proxy FreeRADIUS instances to do some hashing load-balancing
- separate auth and acct servers
- mysql index, query & deployment tuning

The performance achieved was acceptable (but nowhere near "files").

However, the stability issue would never go away. To me it smells of a race 
condition somewhere in the MySQL library. As we could only ever reproduce it by 
cycling 10K or more users, it was proving very difficult to debug.

> Writing a new module should *not* be necessary.
> 

Possibly agree.  Finding and fixing the bug that caused threads to disappear 
would probably have been better.

But we spent far less time coding & testing a few 100 lines of "C" code than 
all the effort over the previous 18 months trying to reproduce, isolate or 
workaround the MySQL problem.  We gave up.

A nice bonus is that we can now head towards a single server configuration with 
a file-system database. This will allow us to retire a raft of servers doing 
proxying, multiple radiusd, and multiple MySQL instances.

Cheers,

Claude.


-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html

Reply via email to