On Jan 13, 2012, at 7:29 AM, Nick Mathewson wrote: > On Fri, Jan 13, 2012 at 7:47 AM, Ralph Castain <r...@open-mpi.org> wrote: >> I've been digging further into this, and I believe I have much of it >> resolved now. However, I have encountered a problem that appears to be >> something in libevent itself. >> >> I configured libevent with debug enabled, and turned it on at execution - >> and was barraged by: >> >> [warn] select: Invalid argument >> >> Digging further into the reason, I found that the message comes from the >> following code in select_dispatch (file select.c): > > Weird that you're using select.c; nearly any other backend would be faster.
It's on a Mac, so select is the option and speed isn't really an issue. We forcibly configure it there for OMPI purposes. :-/ * Default to select() on OS X and poll() everywhere else because * various parts of OMPI / ORTE use libevent with pty's. pty's * *only* work with select on OS X (tested on Tiger and Leopard); * we *know* that both select and poll works with pty's everywhere * else we care about (other mechansisms such as epoll *may* work * with pty's -- we have not tested comprehensively with newer * versions of Linux, etc.). So the safe thing to do is: * * - On OS X, default to using "select" only * - Everywhere else, default to using "poll" only (because poll * is more scalable than select) > >> >> res = select(nfds, sop->event_readset_out, >> sop->event_writeset_out, NULL, tv); >> >> EVBASE_ACQUIRE_LOCK(base, th_base_lock); >> >> check_selectop(sop); >> >> if (res == -1) { >> if (errno != EINTR) { >> event_warn("select"); >> return (-1); >> } >> >> return (0); >> } >> >> The timeout value being supplied to select_dispatch is being corrupted after >> the first time thru the routine - it comes into the routine the first time >> as {0, 0}, but is an illegal value thereafter. Resetting the timeout to the >> original value resolves the problem. > > What kind of illegal value are you seeing, 1326467251, 774650 > coming from where? I'm not sure who calls "select_dispatch" - the value is passed into it. > Are you > using the common_timeout code? This is just flowing thru from a call to event_loop - I'm not sure of the progression that takes us down to select_dispatch. > What are you doing to "reset the > timeout" ? Just hacked things to save the value from the first call into the function, then replace it if there is a problem: static struct timeval rhctv; static int rhcfirst=1; static int rhccnt=0; static int rhcretry=0; static int select_dispatch(struct event_base *base, struct timeval *tv) { int res=0, i, j, nfds; struct selectop *sop = base->evbase; if (1 == rhcfirst) { fprintf(stderr, "ORIGINAL TV %d sec %d usec\n", (int)tv->tv_sec, (int)tv->tv_usec); rhctv.tv_sec = tv->tv_sec; rhctv.tv_usec = tv->tv_usec; rhcfirst = 0; } rhccnt++; rhcretry = 0; check_selectop(sop); if (sop->resize_out_sets) { fd_set *readset_out=NULL, *writeset_out=NULL; size_t sz = sop->event_fdsz; if (!(readset_out = mm_realloc(sop->event_readset_out, sz))) return (-1); sop->event_readset_out = readset_out; if (!(writeset_out = mm_realloc(sop->event_writeset_out, sz))) { /* We don't free readset_out here, since it was * already successfully reallocated. The next time * we call select_dispatch, the realloc will be a * no-op. */ return (-1); } sop->event_writeset_out = writeset_out; sop->resize_out_sets = 0; } memcpy(sop->event_readset_out, sop->event_readset_in, sop->event_fdsz); memcpy(sop->event_writeset_out, sop->event_writeset_in, sop->event_fdsz); nfds = sop->event_fds+1; retry: EVBASE_RELEASE_LOCK(base, th_base_lock); res = select(nfds, sop->event_readset_out, sop->event_writeset_out, NULL, tv); EVBASE_ACQUIRE_LOCK(base, th_base_lock); check_selectop(sop); if (res == -1) { if (errno != EINTR) { event_warn("select"); fprintf(stderr, "TV OUT OF SPEC AT CNT %d: value %d:%d\n", rhccnt, tv->tv_sec, tv->tv_usec); tv->tv_sec = rhctv.tv_sec; tv->tv_usec = rhctv.tv_usec; if (0 == rhcretry) { rhcretry = 1; goto retry; } else { exit(0); } return (-1); } return (0); } ... Retrying select with the corrected value always succeeds. It's clearly being overwritten somewhere, but I don't know enough of libevent's internal call sequence to figure out where/why. Note that this comes after loops through that event create/activate sequence we were discussing. I'm going to try and see if a minimal reproducer can be created based on that code. > > -- > Nick > *********************************************************************** > To unsubscribe, send an e-mail to majord...@freehaven.net with > unsubscribe libevent-users in the body. *********************************************************************** To unsubscribe, send an e-mail to majord...@freehaven.net with unsubscribe libevent-users in the body.