Re: FreeRADIUS crashing on Solaris 8
[EMAIL PROTECTED] (Rainer Clasen) wrote: Even with the changes from the radiusd.c you sent me, this goto is still triggered. sigh I think it's the threading problems. The server still uses a few functions which aren't thread-safe, and they should be made thread safe. e.g. gmtime(),. etc. I've made a number of changes to the code in CVS. It should be *more* thread-safe than what it is right now. I don't know if it's perfect yet. I'll have to audit the rest of the code, and that can take a while. Alan DeKok. - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Re: FreeRADIUS crashing on Solaris 8
While non-optimal, would a mutex lock around non threadsafe functions be a viable workaround? It at least allowed a program I've written to function safely .. Penned by Alan DeKok on Mon, Feb 25, 2002 at 05:44:48PM -0500, we have: | [EMAIL PROTECTED] (Rainer Clasen) wrote: | Even with the changes from the radiusd.c you sent me, this goto is still | triggered. | | sigh I think it's the threading problems. The server still uses | a few functions which aren't thread-safe, and they should be made | thread safe. | | e.g. gmtime(),. etc. | | I've made a number of changes to the code in CVS. It should be | *more* thread-safe than what it is right now. I don't know if it's | perfect yet. I'll have to audit the rest of the code, and that can | take a while. | |Alan DeKok. | | - | List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html -- Todd Fries .. [EMAIL PROTECTED] - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Re: FreeRADIUS crashing on Solaris 8
Todd T. Fries [EMAIL PROTECTED] wrote: While non-optimal, would a mutex lock around non threadsafe functions be a viable workaround? It at least allowed a program I've written to function safely .. That's about as much work as fixing the code to use the thread-safe functions, instead of the non-thread-safe functions. Hmm... one approach to fixing the problem would be to edit all of the modules to set the 'non thread safe' flag. If that makes the problem less signifant on Solaris, then it's definitely the thread code causing the problems. Alan DeKok. - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Re: FreeRADIUS crashing on Solaris 8
Alan DeKok wrote: Hmm... I would suggest going to src/main/radiusd.c, function refresh_request(). Look for: [...] and add: if (request-reply (request-reply-code != 0)) goto setup_timeout; I've added this to the version already running with your fix from yesturday. Now I'm waiting for results. I'll keep you informed. I suppose I'll find some time on the weekend to migrate my logging patches to the latest CVS. Rainer -- KeyID=759975BD fingerprint=887A 4BE3 6AB7 EE3C 4AE0 B0E1 0556 E25A 7599 75BD - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Re: FreeRADIUS crashing on Solaris 8
[EMAIL PROTECTED] (Rainer Clasen) wrote: I've taken the changes from CVS and applied them to my patched version (only logging enhancements). proxy.c:1.52-1.53, radiusd.c:1.238-1.239. I got the following backtrace: ... (gdb) p *request $1 = {magic = 16909060, packet = 0x0, proxy = 0x0, reply = 0x0, And that's the magic number saying that src/main/util.c, function request_free() was called on the request. That is, SOMETHING deleted the request while it was still alive. That can ONLY result from a call to request_free(), or rl_delete(). If you add log messages to log the request number before any call to request_free() or rl_delete() (only in radiusd.c), then you can at least tell *which* call resulted in it deleting an active request. That will help a lot in tracking down the problem. Another suggestion is to start the server, and then add an entry in 'radiusd.conf', which is 'debug_level = 2'. Send a HUP signal to the server, and you will get all of the debugging logs going to the log file, and it will still run in threaded mode. There will be a *lot* of log messages, though. again, the proxysecret belongs to a server marked dead immediately before the crash. That shouldn't be a problem... But the secret belongs to a NAS usually not used by users of this realm. There was no matching entry (realm + NAS) in the logfile. That's a problem. In all cases it was due to non-auth requests. That's another piece of the puzzle, which is important. Accounting requests are deleted immediately after a response is sent to the NAS, as there can be no duplicate accounting requests. It may be deleting the request too soon... Another suggestion is to go to radiusd.c, around line 2377. It says Cleaning up request %d ID %d with timestamp %08lx. Change the 'if' condition before that so it does NOT check for PW_ACCOUNTING_REQUEST. If that makes the difference, then it's narrowed down. I'll take another look at the code to see what the heck is going on. Alan DeKok. - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Re: FreeRADIUS crashing on Solaris 8
Hallo erstmal! Rainer Clasen schrieb: Alan DeKok wrote: Hmm... I would suggest going to src/main/radiusd.c, function refresh_request(). Look for: [...] and add: if (request-reply (request-reply-code != 0)) goto setup_timeout; I've added this to the version already running with your fix from yesturday. Now I'm waiting for results. I'll keep you informed. Ok, the new goto is triggered quite often. But the daemon still dies. Rainer -- KeyID=759975BD fingerprint=887A 4BE3 6AB7 EE3C 4AE0 B0E1 0556 E25A 7599 75BD - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Re: FreeRADIUS crashing on Solaris 8
Hallo erstmal! Rainer Clasen schrieb: Rainer Clasen schrieb: Alan DeKok wrote: Hmm... I would suggest going to src/main/radiusd.c, function refresh_request(). Look for: [...] and add: if (request-reply (request-reply-code != 0)) goto setup_timeout; I've added this to the version already running with your fix from yesturday. Now I'm waiting for results. I'll keep you informed. Ok, the new goto is triggered quite often. But the daemon still dies. Even with the changes from the radiusd.c you sent me, this goto is still triggered. Rainer -- KeyID=759975BD fingerprint=887A 4BE3 6AB7 EE3C 4AE0 B0E1 0556 E25A 7599 75BD - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Re: FreeRADIUS crashing on Solaris 8
[EMAIL PROTECTED] (Rainer Clasen) wrote: Mitchell, Michael wrote: I'm currently testing freeradius-snapshot-20020114 (configured as a proxy only) on Solaris 8 and running into a problem. I'm seeing similar crashes with my patched 2002-02-11 version on Solaris 7. Hmm... I'm starting to think that this may *not* be a thread problem. simul_count = 0, simul_mpp = 0, finished = 1, options = 0 The request has been marked 'finished', so NOTHING should be using any entries of the request. There should be NO threads or anything else which is working with the request. Hmm, why is request-proxy == NULL? Because the request is finished, and many fields can be cleaned up. It is malloced in proxy_send() and never reset. rad_send(), whichis called a few lines before the crash doesn't seem to modify it too (well, actually it cant't, as it doesn't get a pointer to the pointer) Once an active request has been marked finished, then any logic about what's supposed to happen goes out the window. The only thing I can think to do is to mark up the request as alive, while a thread is processing it. Then, add assertions that request-finished isn't marked '1', while the thread is alive. I've seen the suggestion of running radiusd with -s. What side effects do I have to expect? It will probably be slower. But it shouldn't be too serious in the short term. Alan DeKok. - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Re: FreeRADIUS crashing on Solaris 8
Mitchell, Michael wrote: I'm currently testing freeradius-snapshot-20020114 (configured as a proxy only) on Solaris 8 and running into a problem. I'm seeing similar crashes with my patched 2002-02-11 version on Solaris 7. #0 0x19510 in proxy_send (request=0x1b13250) at proxy.c:398 398 request-proxy-timestamp = request-timestamp - (delaypair ? delaypair-lvalue : 0); (gdb) p delaypair $1 = (VALUE_PAIR *) 0x3c7244b9 (gdb) p *request $2 = {magic = 16909060, packet = 0x0, proxy = 0x0, reply = 0x0, proxy_reply = 0x0, config_items = 0x0, username = 0x2047108, password = 0x0, secret = x, '\000' repeats 22 times, child_pid = 0, timestamp = 1014121657, number = 61812, proxysecret = xxx\000\000x\000x, proxy_is_replicate = 0, proxy_try_count = 7, proxy_next_try = 1014121662, simul_max = 0, simul_count = 0, simul_mpp = 0, finished = 1, options = 0, container = 0x41b20} (gdb) bt #0 0x19510 in proxy_send (request=0x1b13250) at proxy.c:398 #1 0x15a24 in rad_respond (request=0x1b13250, fun=0x17838 rad_accounting) at radiusd.c:1521 #2 0x1fe60 in request_handler_thread (arg=0xa7bc0) at threads.c:169 Hmm, why is request-proxy == NULL? It is malloced in proxy_send() and never reset. rad_send(), whichis called a few lines before the crash doesn't seem to modify it too (well, actually it cant't, as it doesn't get a pointer to the pointer) #0 0x19530 in proxy_send (request=0xa00218) at proxy.c:403 403 if ( mainconfig.log_proxy_nonauth || (request-packet-code == PW_AUTHENTICATION_REQUEST)) { (gdb) bt #0 0x19530 in proxy_send (request=0xa00218) at proxy.c:403 #1 0x15a24 in rad_respond (request=0xa00218, fun=0x17838 rad_accounting) at radiusd.c:1521 #2 0x1fe60 in request_handler_thread (arg=0x4b8130) at threads.c:169 (gdb) p mainconfig $1 = {log_auth = 1, log_auth_badpass = 1, log_auth_goodpass = 1, do_usercollide = 0, log_proxy = 1, log_proxy_retransmit = 1, log_proxy_nonauth = 0, do_lower_user = 0x9d208 no, do_lower_pass = 0x9d218 no, do_nospace_user = 0x9d228 no, do_nospace_pass = 0x9d238 no, nospace_time = 0x0} (gdb) p *request $2 = {magic = 0, packet = 0x0, proxy = 0x129030, reply = 0x0, proxy_reply = 0x9fec48, config_items = 0x0, username = 0x0, password = 0x0, secret = xxx, '\000' repeats 24 times, child_pid = 0, timestamp = 1014136041, number = 25277, proxysecret = xxx\000xxx\000\000\000\000\000, proxy_is_replicate = 0, proxy_try_count = 7, proxy_next_try = 1014136046, simul_max = 0, simul_count = 0, simul_mpp = 0, finished = 1, options = 10486288, container = 0x41790} Ok, this happens in code I've added. But shouldn't request-packet always be non-NULL? Well and it was already used successfully in proxy_send a few lines before. I've seen the suggestion of running radiusd with -s. What side effects do I have to expect? Rainer -- KeyID=759975BD fingerprint=887A 4BE3 6AB7 EE3C 4AE0 B0E1 0556 E25A 7599 75BD - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Re: FreeRADIUS crashing on Solaris 8
Rainer Clasen wrote: in both cases the only server for a realm was marked dead immediately before the crash. I forgot to mention: This server has the secret, which is in request-proxy_secret when the daemon dies. So it seems to be a packet related to the same server which was marked dead. Rainer -- KeyID=759975BD fingerprint=887A 4BE3 6AB7 EE3C 4AE0 B0E1 0556 E25A 7599 75BD - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Re: FreeRADIUS crashing on Solaris 8
[EMAIL PROTECTED] (Rainer Clasen) wrote: in both cases the only server for a realm was marked dead immediately before the crash. On further examination, the code in rad_respond() does NOT check for errors returned from proxy_send(). So if the request is marked to be proxied, and the realm is dead, then something wrong happens. I'll commit a bug fix now. Grab the CVS snapshot from tonight. If this fixes your problem, I think we should release 0.5 ASAP. Alan DeKok. - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
Re: FreeRADIUS crashing on Solaris 8
Mitchell, Michael [EMAIL PROTECTED] wrote: Setting cleanup_delay to 0 seems to have helped to alleviate the problem, but it does not prevent it. The server tends to run longer before crashing - we're into the minutes now rather than seconds - but the problem is definitely still there. Damn. It's probably some weird race condition... I saw an email in the archives from Alan that from memory said the server should not be considered stable running non-threaded. Is this still the case, or has this code been cleaned up now? It was a reasonably old email I think. If you run it in '-s' mode, it should be fine. Alan DeKok. - List info/subscribe/unsubscribe? See http://www.freeradius.org/list/users.html
RE: FreeRADIUS crashing on Solaris 8
Hi Randy, Thanks for your help. Setting cleanup_delay to 0 seems to have helped to alleviate the problem, but it does not prevent it. The server tends to run longer before crashing - we're into the minutes now rather than seconds - but the problem is definitely still there. I saw an email in the archives from Alan that from memory said the server should not be considered stable running non-threaded. Is this still the case, or has this code been cleaned up now? It was a reasonably old email I think. Thanks, Michael -Original Message- From: Randy Moore [mailto:[EMAIL PROTECTED]] Sent: Wednesday, 16 January 2002 11:15 To: [EMAIL PROTECTED] Subject: Re: FreeRADIUS crashing on Solaris 8 Hi, In radiusd.conf, try setting 'cleanup_delay' to 0 or even -1. It seems to be related to some different segfaults I've been seeing. Mine have to do with CHAP requests and SQL authentication. At 07:47 AM 1/16/2002 +0800, you wrote: Hi, I have the exact same problem (almost same gdb trace running on Linux too. It seems that the proxy.c is giving problem. I setup two clients (using radclient) to send the acct records to the radius daemon which then proxied to another server. With 1 client everything is fine. Once the next client kicks in, the daemon crashed. /hh On Wed, 16 Jan 2002, Mitchell, Michael wrote: Hello list, I'm currently testing freeradius-snapshot-20020114 (configured as a proxy only) on Solaris 8 and running into a problem. radiusd will run for a short (seemingly random) period of time (any where from say 10 seconds to 30 seconds) and happily processing requests until it simply dies with a signal 9 (SIGKILL??) and core dumps. The problem also seems to relate to the load put on the server. At 4 or 5 requests per second it will start to exhibit the crashing behaviour within about 30 seconds. At 2 or 3 requests per second it is taking several minutes for the problem to appear. The problem also has only so far appeared for accounting requests. Maybe there is a timing issue somewhere, since accounting requests take that much longer to process as my proxy has to wait for a response to come back from the second radius server? Using gdb it appears that radiusd is crashing at at least a few different places, which is not very helpful, and kind of suggests it may not be an actual bug in FreeRADIUS? Here are three back traces that I captured: #0 0x188b4 in proxy_send (request=0x9cb18) at proxy.c:317 317 request-proxy-timestamp = request-timestamp; (gdb) bt #0 0x188b4 in proxy_send (request=0x9cb18) at proxy.c:317 #1 0x15480 in rad_respond (request=0x9cb18, fun=0x170a0 rad_accounting) at radiusd.c:1527 #2 0x1ecf8 in request_handler_thread (arg=0x98110) at threads.c:169 -- -- - #0 0xff141da4 in t_delete () from /usr/lib/libc.so.1 (gdb) bt #0 0xff141da4 in t_delete () from /usr/lib/libc.so.1 #1 0xff141998 in realfree () from /usr/lib/libc.so.1 #2 0xff14226c in cleanfree () from /usr/lib/libc.so.1 #3 0xff1413a0 in _malloc_unlocked () from /usr/lib/libc.so.1 #4 0xff141294 in malloc () from /usr/lib/libc.so.1 #5 0x22538 in rad_decode (packet=0xa00f8, original=0xa3b68, secret=0x98dec gloople) at radius.c:1060 #6 0x15208 in rad_respond (request=0x98da0, fun=0x170a0 rad_accounting) at radiusd.c:1437 #7 0x1ecf8 in request_handler_thread (arg=0x982f0) at threads.c:169 -- -- - #0 0x23220 in pairfind (first=0x190, attr=41) at valuepair.c:97 97 first = first-next; (gdb) bt #0 0x23220 in pairfind (first=0x190, attr=41) at valuepair.c:97 #1 0x1888c in proxy_send (request=0x9d728) at proxy.c:312 #2 0x15480 in rad_respond (request=0x9d728, fun=0x170a0 rad_accounting) at radiusd.c:1527 #3 0x1ecf8 in request_handler_thread (arg=0xa70f0) at threads.c:169 This appears to point back to the threading, but whether it is a Solaris issue or a FreeRADIUS issue I'm not really sure. The log files don't appear (to me) to give a definitive answer to what is happening here, except that at the time of the crash, I'm getting incomplete attribute logging such as: Thread 2 handling request 167, (17 handled so far) Proxy-State = 0x313639 Sending Accounting-Response of id 169 to 203.108.109.27:62729 Finished request 167 Going to the next request Thread 2 waiting to be assigned a request NAS-IP-Address = 203.108.109.27 = 1 = Async = Start = 123 Proxy-State = 169 = UNKNOWN-TYPE When I run the server with the -s option