Richard, I spent some more time on this today. I am still under the impression that there is a race between F_SETOWN/F_SETSIG and a counter overflow with notification.
Is there a way you could modifiy your program such that no thread call pfm_start() before all setups are done? Then I suspect the problem will disappear. On Mon, Apr 30, 2007 at 05:21:47PM -0400, Richard C Bilson wrote: > Stephane, > > > I think the core issue is that you have a race condition in your > > program between the various worker threads. The race has to do with > > dosignal(). I noticed that you get the wrong fd almost instantly when > > you hit the problem. I think this may be due to the fact that you have > > a race between one thread starting monitoring and generating samples > > and thus notification vs. another thread coming online, i.e., starting > > to execute dosignal(). I am not 100% sure this is the problem because the > > monitoring thread has set its F_SETSIG, so it should be the only one > > receicing the signal, yet I have not verified the logic in the kernel. > > It may be that if a thread has not yet set its F_SETSIG, then it may be > > chosen first. > > If this is correct, then the meaning of this combination of F_SETSIG > with F_SETOWN is evidently "send SIGIO to this thread, or to any other > thread that hasn't called F_SETSIG on a different fd." Describing the > problem as a "race" in my program assumes this behavior. However, this > doesn't seem to me to be obvious or particularly useful behavior. I'd > go so far as to call it a bug. > > It's easy enough to test this hypothesis by setting up the handler and > unmasking SIGIO after the calls to fcntl, but before the call to > pfm_self_start. This ought to remove the "race" that concerns you. I > just tried this, and it still fails. > > > I simply added a big sleep between dosignal() and the > > beginning of active monitoring. > > I'm not sure that a "big" sleep makes a difference, since the threads > are created in quick succession and likely sleep concurrently. > > What would you expect the results of such an experiment to be? I see it > fail less often, but it still fails. > > > The other thing I did to the program is that I explicitly blocked SIGIO > > in the master thread. > > This doesn't make an obvious difference on my machine, but it's a good > idea. > > Thanks for your continuing investigation. > _______________________________________________ > perfmon mailing list > [email protected] > http://www.hpl.hp.com/hosted/linux/mail-archives/perfmon/ -- -Stephane _______________________________________________ perfmon mailing list [email protected] http://www.hpl.hp.com/hosted/linux/mail-archives/perfmon/
