I believe that I can reproduce this, give or take. if I use this click config:
InfiniteSource -> q1 :: ThreadSafeQueue(CAPACITY 10) -> rs :: RatedUnqueue(RATE 1000) -> q :: ThreadSafeQueue(CAPACITY 10000) -> uq :: Unqueue -> c :: Counter -> d :: Discard; Script(wait 3, print $(c.count) $(uq.scheduled) $(q.highwater_length), write c.reset_counts, loop); StaticThreadSched(rs 0, uq 1); and then run multiple instances of click, I get very weird results. I have an 8 core machine, but even running a total of 4 instances of click with this config causes it to get stuck for up to 3 seconds at a time (which corresponds to the "wait 3" in it. If I change that wait, then it gets stuck for longer) If I run with --threads=1, I never see the behavior, even with 20+ click processes running. I believe that I also see the same behavior with normal Queue instead of ThreadSafeQueue. I won't have much more time to spend on this for a while though... the last few bugs along these lines were quite hairy and required adding memory barrier instructions (look at commits 8f521a2d and 01f456ac for an idea). This bug could just be because the new code is faster, so now we are more likely to see an existing race condition, or it could be because a new race condition was added. If the same test procedure that worked for me works for you, perhaps you can track it down to one commit that caused this regression? Cliff bug:~/co/click/build2$ ./userlevel/click -p 1234 --threads=2 ./foo.click 3020 false 19 2999 false 19 3000 false 19 3000 false 19 3000 false 19 3001 false 19 3000 false 19 -> now I run 3 more processes in another terminal 651 true 2326 5326 false 2328 3001 false 2328 3000 false 2328 868 true 2328 5127 false 2328 3001 false 2328 3000 false 2328 312 true 2688 2885 true 2803 5803 false 2805 214 true 2805 5766 false 2805 3001 false 2805 3001 false 2805 On Wed, Feb 9, 2011 at 5:22 AM, Beyers Cronje <[email protected]> wrote: > Hi Eddie, > > Some info that I'm sure will help debugging. I configured click > with --enable-schedule-debugging=extra and also enabled > NOTIFIERQUEUE_DEBUG > > It seems Unqueue gets stuck in the pending list. See the output below: > > read q.length > 1000 > > read q.notifier_state > notifier on > task 0x19387b0 [uq :: Unqueue] scheduled > > read uq.scheduled > true /* but pending */ > > read uq.notifier > empty.0/1:1* > > > Unqueue stays in this scheduled but pending state for an undetermined > period > of time, up to minutes some times. > > Any idea where I can start on fixing this bug? > > Beyers > > > On Wed, Feb 9, 2011 at 2:52 AM, Beyers Cronje <[email protected]> wrote: > > > Update - Strange, after typing my previous email I checked again, and all > > of a sudden Unqueue was pulling packets again. Not sure if me breaking > into > > Click with gdb kick-started, it again :) or if it's a intermittent issue. > > > > > > On Wed, Feb 9, 2011 at 2:41 AM, Beyers Cronje <[email protected]> wrote: > > > >> Hi Eddie, > >> > >> Since running this merge I've been experiencing issues with usermode > >> multithreading. I'm using commit > 9419098acbdc20837e37f3033c40661809431f8d > >> I do believe the issues are related to the changes of this merged, as I > >> was running the same config on pre-coreperformance merge code without > any > >> issues. > >> > >> *Simplified Config used:* > >> > >> fd::FromDAG(/dev/dag0) -> cl1::Classifier(12/0800) -> MarkIPHeader(14) > -> > >> ipc1::IPClassifier(udp port 1646); > >> sfp::SFP(); > >> q::ThreadSafeQueue; > >> log::Logger(sfp, 1800); > >> > >> ipc1 -> RadAccounting -> q; > >> q -> uq::Unqueue -> aupcc::Counter -> sfp; > >> > >> StaticThreadSched(fd 0, uq 1, log 2); > >> > >> *Problem Description:* > >> The 3 threads in the config: FromDAG and Unqueue do what their names > >> indicate, and Logger has a timer that schedules its task once every > 30min. > >> Everything runs fine initially and then somewhere along the line Unqueue > >> stops pulling packets from the queue which leads to a constant queue > >> overflow. When this happens I can see FromDAG is still working as > q.drops > >> increases constantly, and Logger also fires away every 30min. Note push > rate > >> from FromDAG is quite high, but what RadAccounting pushes into queue is > very > >> low ~ 100pps, which means queue is most of the time empty and Unqueue is > not > >> scheduled. > >> > >> Unfortunately I didn't configure debug scheduling, which would've > helped. > >> I did notice Unqueue.scheduled always returned true, even though it > never > >> actually ran. Not sure if the following will help, gdb shows the > following: > >> > >> (gdb) info threads > >> 3 Thread 0x7fa422559700 (LWP 27205) 0x000000392cedb0b3 in poll () > from > >> /lib64/libc.so.6 > >> 2 Thread 0x7fa421d58700 (LWP 27206) 0x000000392cedcee3 in select () > >> from /lib64/libc.so.6 > >> * 1 Thread 0x7fa432cfb740 (LWP 27197) FromDAG::process_packet > >> (this=0x1afc570, erf_record=0x7fa423258bf0, rlen=<value optimized out>) > at > >> ../elements/local/fromdag.cc:193 > >> (gdb) thread 2 > >> [Switching to thread 2 (Thread 0x7fa421d58700 (LWP 27206))]#0 > >> 0x000000392cedb0b3 in poll () from /lib64/libc.so.6 > >> (gdb) bt > >> #0 0x000000392cedb0b3 in poll () from /lib64/libc.so.6 > >> #1 0x000000000058bc36 in Master::run_selects_poll (this=0x1afacc0, > >> thread=0x1afb8b0, more_tasks=false) at ../lib/master.cc:862 > >> #2 0x000000000058c4ec in Master::run_selects (this=0x1afacc0, > >> thread=0x1afb8b0) at ../lib/master.cc:1050 > >> #3 0x000000000057ddf1 in run_os (this=0x1afb8b0) at > >> ../lib/routerthread.cc:447 > >> #4 RouterThread::driver (this=0x1afb8b0) at ../lib/routerthread.cc:568 > >> #5 0x0000000000556dc9 in thread_driver (user_data=<value optimized > out>) > >> at click.cc:414 > >> #6 0x000000392d206d5b in start_thread () from /lib64/libpthread.so.0 > >> #7 0x000000392cee4aad in clone () from /lib64/libc.so.6 > >> (gdb) thread 3 > >> [Switching to thread 3 (Thread 0x7fa422559700 (LWP 27205))]#0 > >> 0x000000392cedcee3 in select () from /lib64/libc.so.6 > >> (gdb) bt > >> #0 0x000000392cedcee3 in select () from /lib64/libc.so.6 > >> #1 0x000000000058c4bf in Master::run_selects (this=0x1afacc0, > >> thread=0x1afb7e0) at ../lib/master.cc:1015 > >> #2 0x000000000057ddf1 in run_os (this=0x1afb7e0) at > >> ../lib/routerthread.cc:447 > >> #3 RouterThread::driver (this=0x1afb7e0) at ../lib/routerthread.cc:568 > >> #4 0x0000000000556dc9 in thread_driver (user_data=<value optimized > out>) > >> at click.cc:414 > >> #5 0x000000392d206d5b in start_thread () from /lib64/libpthread.so.0 > >> #6 0x000000392cee4aad in clone () from /lib64/libc.so.6 > >> (gdb) thread 1 > >> [Switching to thread 1 (Thread 0x7fa432cfb740 (LWP 27197))]#0 > >> 0x000000392d20ebfd in nanosleep () from /lib64/libpthread.so.0 > >> (gdb) bt > >> #0 FromDAG::process_packet (this=0x1afc570, erf_record=0x7fa423258bf0, > >> rlen=<value optimized out>) at ../elements/local/fromdag.cc:193 > >> #1 0x00000000004d6402 in FromDAG::run_task (this=0x1afc570) at > >> ../elements/local/fromdag.cc:150 > >> #2 0x000000000057dbe6 in fire (this=0x1afb710) at > >> ../include/click/task.hh:612 > >> #3 run_tasks (this=0x1afb710) at ../lib/routerthread.cc:410 > >> #4 RouterThread::driver (this=0x1afb710) at ../lib/routerthread.cc:600 > >> #5 0x0000000000558513 in main (argc=<value optimized out>, argv=<value > >> optimized out>) at click.cc:639 > >> > >> One thing to note, for various reasons I'm doing something very > >> unclick-like with FromDAG where I allow it to block up to 10ms. For my > >> specific requirements this is not a problem, but just in case it might > >> affect the way the new task handling operates it's worth noting. > >> > >> Beyers > >> > >> On Sun, Feb 6, 2011 at 1:23 AM, Eddie Kohler <[email protected]> > wrote: > >> > >>> Hi all, > >>> > >>> This is just a note to say that I've merged the "coreperformance" > branch > >>> with master. There are several changes that may speed up particularly > >>> simple configurations, and that (more importantly) may make it easier > to > >>> experiment with different multithreading setups. (For instance I > >>> believe switching a task from one thread to another is faster now.) > Let > >>> me know if you experience any problems > >>> > >>> Eddie > >>> _______________________________________________ > >>> click mailing list > >>> [email protected] > >>> https://amsterdam.lcs.mit.edu/mailman/listinfo/click > >>> > >> > >> > > > _______________________________________________ > click mailing list > [email protected] > https://amsterdam.lcs.mit.edu/mailman/listinfo/click > _______________________________________________ click mailing list [email protected] https://amsterdam.lcs.mit.edu/mailman/listinfo/click
