Dear everyone, I'm facing a problem somewhere between G++ and GNU STL, LinuxThreads, signals and setitimer - or maybe just my "delete while traversing a set" algorithm.
Can't say exactly where the problem is - I have tested all those elements individually on past occasions, but they don't seem to work together in this case :-) It might indeed be a multi-way clash, or it might be a trivial coding error. I have a for(;;) loop, iterating over a set<>, deleting elements during the iteration. That iteration runs in a "consumer" thread. Elements are added to the set<> in a "producer" thread. Both these threads run along timers, implemented using setitimer(), and synchronize using posix mutex calls wrapped around the critical sections touching the set<>. In addition, on program startup and shutdown, the program main thread uses SIGALRM (and a bool latch) to invoke shutdown in the two worker threads. The problem is, that when I invoke shutdown via a SIGTERM or SIGINT to the program main thread, the loop iteration goes awry in an interesting way - one loop cycle runs to the closing bracket, and then the thread starts to eat 100% of CPU time without starting another loop cycle, but also without breaking the loop. As if, while evaluating the loop condition, the comparison operator of the set<>::iterator got trapped in an endless loop... I've discovered the problem while working on a somewhat complex piece of software, involving several producer/consumer/timer threads - but I've managed to narrow down the problem to a simpler example. You can check out the "small" demonstration example here: http://sweb.cz/Frantisek.Rysanek/cc_test.tgz The key part of the code can be summarized in an even shorter form - take a look at the code snippet below. Please note especially the for(;;) loop that attempts to delete items of a set<> while iterating. // <SNIPPET> class timer_job { public: struct timeval time_of_enq; int check(); // returns a "1" if timed out, returns a "0" if OK } set<timer_job*> jobs; void check_jobs() { set<timer_job*>::iterator i, deletion_candidate; bool was_first; for (i = jobs.begin(); i != jobs.end(); i++) { // Check each job in our queue/set, // to see if by any chance it has timed out deletion_candidate = i; if (i == jobs.begin()) was_first = true; else was_first = false; if ((*i)->check() > 0) // if job has timed out { // do not decrement past the array base if (! was_first) i--; delete (*deletion_candidate); // kick off, you jobs.erase(deletion_candidate); // reset ptr to the new array base if (was_first) i = jobs.begin(); // else OK, go ahead from the previous element, // incrementing at loop end } // else OK, this job is still pending } } // </SNIPPET> If I've already answered my question (can't delete while iterating), please suggest a workaround. In my practical scenario, I do need to keep my data in a set<>, rather than e.g. vector<> or queue<>, because i need to check for unique occurrence when inserting the set elements, and the tree rebalancing doesn't hurt that bad. In the past, when on one occasion I used GLIB gtrees, I used to stuff my elected "deletion candidates" into a scratchpad vector<> during the traversal, and deleted from there afterwards... In the real-world software piece that I'm working on, the loop goes nuts, in the same way, even without any signals flying or threads terminating. I've tried using sleep() and usleep() for the timing, which didn't make a difference. Blocking SIGALRM for the critical sections doesn't make a difference either. If I add debugging instrumentation (syslog calls) to any of the threads, it often does change behavior a bit - the flaw occurs less frequently, it happens in another similar thread, two threads go nuts at the same time etc. (At timing periods in the hundreds of milliseconds, syslog calls tend to flood syslog and load the system in general.) If I comment out the whole for(;;) loop, the example code runs and terminates just fine. Both the example and my real-world code misbehave consistently if compiled by GCC 2.96 on Linux 2.2 (RH 6.2) or GCC 3.x on Linux 2.4 (RH 8.0, MDK 9.1), and on various hardware platforms: Intel Pentium MMX @ 166 MHz, Cyrix/NS/AMD Geode GX1 @ 233 MHz, Intel P4 @ 1.7 GHz. I sure have further ideas of what else to try, but maybe someone will point out my mistake straight away, before I take the time to try all the dead ends :-) Specifically I'd be interested to know if some of the components that I'm (ab)using are potentially thread- and signal-unsafe: GNU STL, SGI STL, GNU libstdc++, GNU G++, LinuxThreads, setitimer()... Please note that I'm calling setitimer() from several threads in parallel, and there does seem to be a per-thread timer, at least in LinuxThreads, not sure about NPTL. I'd crosspost to comp.lang.c++, but I'd hate to become yet another off-topic poster in that very focused and useful newsgroup. Thanks for any ideas Frank Rysanek _______________________________________________ Help-gplusplus mailing list Help-gplusplus@gnu.org http://lists.gnu.org/mailman/listinfo/help-gplusplus