On Jul 11, 2009, at 2:14 AM, Alan DeKok wrote:

Philip Molter wrote:
Yes, this is the configuration I'm currently running, and it's not
working for me. I have a radclient sending a request, retrying 10 times
on a 5-second timer, and after 10 retries, it still hasn't gotten a
response. After the second retry, the proxy has marked the server as at
least a zombie and started status-checks, but every retransmit after
that is getting a cached result of no response.

 Could you possibly try READING my messages?

 The default configuration does NOT include the "do_not_respond"
policy. *YOU* are the one who configured that, as I have said multiple
times.

 If you don't want it to get the cached "do not respond", policy, then

        DON'T CONFIGURE IT

 It's that easy.

I do not want to get ANY cached response. I do not want to get any Access-Reject. If I do not configure a 'do not respond' response, I get an Access-Reject, which is even worse because my end-client gets an error when he should not. What I want is for a no-response from a home server to be treated as a no-response to the NAS, and the subsequent retransmit from the NAS to be processed as a retransmit to a different home server.

This is what I want to happen

client req ->  proxy
              proxy req ->  home server #1
client ret ->  proxy
              proxy ret ->  home server #1
             [proxy fails home server #1 for lack of response]
client ret ->  proxy
              proxy req ->  home server #2
              proxy <- resp home server #2
client <- resp proxy

 It does that (mostly).  But only if you don't break the server.

No, it does not do that all. I have yet to see a retransmit from a client actually get tried on a different server than the one used for the original request. Once the proxy fails to receive a response from the originally chosen home-server, it handles the packet as a failure. If it sends back an Access-Reject packet, the request is rejected by the NAS to the client and that NAS stops retrying THAT REQUEST (ie. the end-client gets an error). If I add a configuration to not send back anything, then the NAS will retransmit, but like you have made abundantly clear, the proxy remembers that you sent back no response to the original request and skips all further processing of the retransmit.

I have set response_windows and zombie_periods to minimums. I have set response_windows and zombie_periods to maximums. For a given single request, only one home server is tried, and if that home server is down, the request and any other retransmits of that request will not succeed. Yes, if the NAS sends another separate request with a different ID, it will be proxied to a different home server, but that does not help the poor guy who had the hard luck of his request hitting the bad home server. He will get an error message. He will have to retry or call support or whatever.

This is what happens without a post-proxy config:

client req ->  proxy
              proxy req -> home server #1
client ret ->  proxy
              proxy ret -> home server #1
             [proxy fails home server #1 for lack of response]
client  <- rej proxy

 That happens for the most part because you played with the
configuration to make the proxy timeouts super-short. As I said, don't
do that.

It does not matter whether the timeouts are short or long. This always happens. See my note above.

In fact, no matter what I set the timeouts to, it always seems to fail the server and reject the request after the first retransmit to the proxy (2 packets, about 10 seconds, regardless of the response_window or zombie_period settings). Yes, a subsequent, different request will go to a different home server, but, again, I want to use the proxy to provide smarter resiliency across a pool of servers. If you know of settings for response_window and zombie_period that I can use that will provide the behavior in my "this is what I want to happen" example, could you provide them please? Because all of the settings I use seem to result in the same behavior.

Okay, so I obviously do not understand how I can tweak response_window
and zombie_period to make sure that requests that can be serviced by
many possible RADIUS home servers do not return an Access-Reject when
one of those home servers does not respond.

 i.e. you want NO request to fail processing when a home server fails.

 This is extremely difficult to do.  Any naive approach that has quick
failover can have other negative side-effects.  (Additional network
traffic, system load, duplicate processing of requests, etc.)

I guess I do not see those as negatives. That is exactly what I want to happen. RADIUS network traffic is tiny. The system load created by sending multiple requests to a home server or a bunch of home servers is minimal. I am not seeing how you are adding any more load when instead, the proxy sends back an Access-Reject, which, in the best case scenario, will result in the end-client re-authenticating, generating yet another request. In the worst case scenario, the client accepts the reject as validation that their account cannot be authorized and presents the wrong result to the end-user (whether that be a guy sitting on the end of a dial-up line or a piece of system software trying to determine whether an account is valid). All you are doing is pushing the logic for retrying from a machine that knows the there are multiple possible home servers that can respond to a machine that does not via a response that says, effectively, "Do not retry. Your request is invalid."

Your argument that the RADIUS server cannot handle a retry does not hold water to me, but regardless, I can envision configurations where you would want to minimize all processing by the RADIUS proxy itself (most machines now have way more processing power than a simple RADIUS proxy can consume, so that is not a common need anymore). I wish the option was available. There seem to be knobs for a lot of other things.

The client sends a request to the proxy.  If a home server does not
respond within a short period of time to the request, a second home
server is chosen.  If the second home server does not respond to the
same request, then a third is chosen. This continues until all possible home servers are exhausted. At that point, an Access-Reject packet is sent back to the client. Otherwise, the response from the home server
is sent back to the client.

Doing that requires source code mods, because that quick fail-over can
have negative side-effects.  i.e. The server does NOT support
configurations that can negatively affect it's performance.

See my note above for why the work to be done by the server is no more or no less than just returning a reject once the timeout is hit. You are either going to be processing more retries to the home server or more retries from the NAS. Either way, you are going to increase your load.

 On top of that, the "try all possible home servers" is impossible.
There is ALSO a 30 second lifetime for the request.  After 30 seconds,
the NAS has given up, so failing over to another home server is useless.

 On top of that, the NAS will only retry 3-6 times.  So if you have 19
home servers, at *best* it would fail over to 3-6 of them, before the
request is marked "timeout".

Okay, AT BEST you get 3-6 different home servers in a 30-second period. Right now, AT BEST I get 1. Which method is more resilient? Which method results in no false rejections being returned to the NAS? The worst that can happen is that the NAS gets no response, which is exactly what would happen if the NAS queried that one home server directly. The proxy can even be smart about it and only retry to a different home server when the NAS retransmits (which I believe it already does), so if the NAS stops retransmitting because it has given up, so does the proxy, but please, let the NAS give up first. The proxy does not know how many times that the NAS will retry. I have my NASes configured to retry for up to 60 seconds, once every 2 seconds. They will retry 30 times. It is more important to me that authentication requests succeed, even if they succeed slowly. It sounds to me like freeradius is making assumptions about how NASes should work, and as a result, reducing the flexibility it provides.

 I sincerely hope you see now that the situation is rather more
complicated than the simple "try all home server" statement.

How do I configure that?  It doesn't seem to matter what I set
response_window or zombie_period to, once the first home server fails to
respond, an Access-Reject (or nothing if I configure a post-proxy
handler) is returned to the client. My client's not going to retry the
request if he gets an Access-Reject, so I need the proxy to retry it.

That last sentence is nonsense. Once the client gets an Access- Reject for *any* reason, it is impossible for the proxy to "retry" that request.

*sigh* Exactly. Once the client gets an Access-Reject, the NAS has told the client that the request is invalid. An end-user querying the NAS gets an error message. A piece of system software querying the NAS gets notified that the account is not valid. The implication is that a retry is futile, even though the account is not actually invalid. The account is perfectly valid. The proxy just gave up too soon (and by too soon, I mean "before it tried more than one of its home servers"). I want the proxy to retry the request to a different home server precisely to prevent the NAS (and thus the client) from getting an Access-Reject when it does not have to. This is typically how load-balancers with failover capability work. They try their best to make sure individual requests succeed when they can.

If you want the proxy to fail over, send it more than ONE request at a
time (like a normal proxying system), and do NOT configure the "do not
respond" policy.

So my NAS now has to send two separate requests for the same authentication, and pick the one that does not come back with an Access-Reject? Which NAS does that? Or are you saying that my end- client has to not accept the fact that he was rejected and keep retrying until he either a) gets an accept or b) gets rejected so many times he accepts it as gospel? Either way, it makes no sense. Either way, the proxy is creating a retry loop.

Again, I am not arguing that the proxy will not fail over. It will for subsequent requests. What a fail-over solution will typically do, though, is fail over even for a given single request, so that all requests are handled as resiliently as possible. In other words, a NAS does not need to see a single failed request from the proxy for the proxy to trigger a failover.

 The proxy WILL fail over, but due to the imperfect nature of the
universe, some requests MAY time out and get rejected.  With a better
detection algorithm, the number of failures might get smaller than it is
today, but it is IMPOSSIBLE to get the number down to zero.

To a NAS, there is a big difference between a timeout and a reject. If it does not get a response, a NAS will typically handle the client differently than if it gets an explicit rejection. Right now, a timeout event from the home server results in an explicit rejection (unless I configure it not to send that reject). It IS possible to get the number down to zero, because I have used RADIUS software that does it. The only time it should ever be non-zero is if all home servers that can possibly be tried in a given window (which might not be all of them, but is most likely going to be more than one of them) fail to respond. Like I said, I am trying to migrate to freeradius for some other features. I have used two other proprietary RADIUS server software packages that implement this behavior.

 No.  RADIUS doesn't work like that.  No amount of magic on the proxy
will cause the NAS to retry forever (which is the only way to have the
proxy cycle through all home servers for one request). If you configure the NAS to retry forever, then all you will do is push network failures
off to some other part of the network.

Right. Precisely. I want to push the network failure handling to the proxy, which has the knowledge that there are multiple points of failure. The NAS does not know that there are 20 possible servers to respond to it. All it knows is that there is 1 RADIUS server it can talk to (the proxy) and if the proxy says the request was rejected, the request is considered rejected. The end-client certainly does not know what can fail. The proxy knows that there are 20 servers. When it decides to fail one server out, it KNOWS a) that the proxied request was not rejected, it just was not responded to by the home server and b) that it can try that request to another home server before it tells the NAS that the request is rejected (the request has not been rejected, of course, since no home server has responded one way or the other yet and until the proxy responds to the NAS, the NAS will not know one way or the other).

I also understand that Accept-Challenge can complicate the proxying, but that is solvable as well with standard state tracking.

 This is how IP connectivity works: Networks are imperfect.  There is
absolutely nothing you can do about that.

I know that networks are imperfect. The answer to that imperfection is to retry, not to give up. When you tell a NAS that the request has been rejected when, in fact, it has not, you are not effectively retrying. You are saying, "Do not retry. You actually got this failed result."

But look, I have gone through the code. Ivan's right, that there is no way to get the behavior I want in freeradius without either a module (not sure if this is even possible to accomplish via a module because proxying is not handled via a module ) or by hacking the code to change how proxy no-responses are handled. It just frustrates me that you challenge the value of this. For people like me who use freeradius not to serve dial gear but to serve as robust authentication platforms for on-network services, where sending a false rejection to a client is an SLA issue, having a proxy that can robustly and transparently handle transient network failures is very valuable. With that, we do not have to reprogram or replace NAS software (some of which we cannot control) to handle those kinds of transient network failures for us.

Philip
-
List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html

Reply via email to