Claude Brown wrote:
> My original reply was confusingly brief. I've clarified below, and I've also
> put the module we wrote into github in case it helps:
>
> https://github.com/claudebrown/freeradius-server/compare/master...rlm_tagfiles
OK. It's... odd.
> We avoided both "fastfile" and reloading "files" on the fly because of the
> number of updates we have to our user setup. The rate of change to our
> customers would require a reload every few seconds during most of the day.
I'd normally just put users into SQL.
> We had concerns in two areas:
> - The time to re-write the config and then re-load so frequently. This may
> become a performance problem as our user base grows out to 250K
> - The risk of using the reload mechanism in a way that didn't seem consistent
> with its design intent, or the likely usage pattern of reloads every day or
> every few hours.
OK. Reloads don't work for you.
> FreeRADIUS core is very stable. But MySQL adds instability we have been
> unable to identify or reproduce in our environment.
That's odd. While MySQL isn't perfect, I have successfully used it in
systems with 100's of transactions/s. There was a VoIP provider ~8
years ago using it with ~1K authentications/s.
> When large parts of our WiMAX network are restarted due to maintenance or
> failure the customer devices re-join the network. Whilst this doesn't happen
> often, when it does happen we need to get as many as 50K devices will
> simultaneously ask to rejoin the network. We need to service this sudden and
> dramatic backlog as quickly as possible.
Yup.
> With the "files" module this is a breeze with a single server. It just eats
> it up and everything comes back in a few minutes. Importantly, our testing
> shows the design goal of 250K users would also be met with one server.
>
> But with "rlm_sql" and MySQL we could not do it. The radiusd would start
> slowly grinding to a halt roughly as we reached 200 auths per sec (with EAP,
> this is about 30 devices per sec). The radiusd log reported "Unresponsive
> child" in a MySQL module and gradually all the database concurrency would
> disappear as those threads were lost for further work.
MySQL does have concurrency issues. But if you split it into
auth/acct, most of those go away. i.e. use one SQL module for
authentication queries. Use a *different* one for accounting inserts.
If you also use the decoupled-accounting method (see
raddb/sites-available), MySQL gets even faster. Having only one process
doing inserts can speed up MySQL by 3-4x.
> With our new far simpler approach, all of this has gone away because we are
> now using the "files" module and "users" file directly. The speed of
> authentication is essentially as per that module.
OK.
> The value of the extra attribute is in essence obtained like this:
> 1. Format a filename such as "/blah/%{Username}"
> 2. Read a line from this file
Using a database WILL be faster than reading the file system.
> We only have about 10 different values in these files: things like
> "voip-customer", "payment-overdue", "gold-customer",
> "exceeded-download-limit", etc. The value is used to select a DEFAULT entry
> in the "users" file that builds the reply attributes needed to configure the
> customers service.
You can do the same kind of thing with SQL. Simply create a table,
and do:
update request {
My-Magic-Attr = "%{sql: SELECT .. from ..}"
}
Have the table contain the mapping of User-Name --> "voip-customer".
You should be able to get very high performance. Then, use that
attribute to do the mappings in the "users" file, just like you do today.
> This happens when we have a major network event that causes lots of devices
> to simultaneously request authentication. Due to the unpredictable loss of
> threads, we have to manually manage the rate of the incoming authentications
> by slowly starting small sections of the network at a time.
>
> This process takes us hours of careful (manual) rate management.
That's just weird. SQL should be fine, *if* you design the system
carefully. That's the key.
> Possibly, but we couldn't find a way. We would be keen to understand the fix
> for this.
See above.
> We had no problem during normal operation. It was only when large numbers of
> devices (typically 10K or more) simultaneously needed to re-join the network
> for some reason.
>
> Do you know if these other sites have those kinds of events?
*Everyone* has this happen. There's really no need for a new module.
> However, the stability issue would never go away. To me it smells of a race
> condition somewhere in the MySQL library. As we could only ever reproduce it
> by cycling 10K or more users, it was proving very difficult to debug.
It's not a race condition, it's lock contention.
> But we spent far less time coding & testing a few 100 lines of "C" code than
> all the effort over the previous 18 months trying to reproduce, isolate or
> workaround the MySQL problem. We gave up.
>
> A nice bonus is that we can now head towards a single server configuration
> with a file-system database. This will allow us to retire a raft of servers
> doing proxying, multiple radiusd, and multiple MySQL instances.
If it works for you...
But it's really just a re-implementation of a simple SQL table. It's
a solution which is specific to your environment.
The more generic solution is:
- custom tables
- split auth/acct
- decouple acct from the "live" server
You should be able to get a very high performance with that. The
benefit is you'll be using real databases, which is usually a good idea.
Alan DeKok.
-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html