Hi Eddie, I've reverted to the pre-coreperformance code on the "semi-production" server for now, and will see if I can replicate the issue on my dev server. Hopefully I can use Cliff's test case to replicate and troubleshoot. Will keep you posted. In all honesty though I haven't wrapped my head around memory barriers and the like, so I'm not sure how far I will get on my own :(
Cliff, care to host an online advanced multithreading course ? :) Beyers On Thu, Feb 10, 2011 at 4:15 PM, Eddie Kohler <[email protected]> wrote: > I'm very sorry about this regression... Unfortunately due to travel it's > going to hard to look at this until the weekend. I would rather not yet > revert coreperformance. Can you handle this situation? > > Eddie > > > > On 02/09/2011 05:22 AM, Beyers Cronje wrote: > >> Hi Eddie, >> >> Some info that I'm sure will help debugging. I configured click >> with --enable-schedule-debugging=extra and also enabled >> NOTIFIERQUEUE_DEBUG >> >> It seems Unqueue gets stuck in the pending list. See the output below: >> >> read q.length >> 1000 >> >> read q.notifier_state >> notifier on >> task 0x19387b0 [uq :: Unqueue] scheduled >> >> read uq.scheduled >> true /* but pending */ >> >> read uq.notifier >> empty.0/1:1* >> >> >> Unqueue stays in this scheduled but pending state for an undetermined >> period >> of time, up to minutes some times. >> >> Any idea where I can start on fixing this bug? >> >> Beyers >> >> >> On Wed, Feb 9, 2011 at 2:52 AM, Beyers Cronje <[email protected] >> <mailto:[email protected]>> wrote: >> >> Update - Strange, after typing my previous email I checked again, and >> all >> of a sudden Unqueue was pulling packets again. Not sure if me breaking >> into Click with gdb kick-started, it again :) or if it's a intermittent >> issue. >> >> >> On Wed, Feb 9, 2011 at 2:41 AM, Beyers Cronje <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hi Eddie, >> >> Since running this merge I've been experiencing issues with >> usermode >> multithreading. I'm using commit >> 9419098acbdc20837e37f3033c40661809431f8d >> I do believe the issues are related to the changes of this merged, >> as >> I was running the same config on pre-coreperformance merge code >> without any issues. >> >> *Simplified Config used:* >> >> fd::FromDAG(/dev/dag0) -> cl1::Classifier(12/0800) -> >> MarkIPHeader(14) >> -> ipc1::IPClassifier(udp port 1646); >> sfp::SFP(); >> q::ThreadSafeQueue; >> log::Logger(sfp, 1800); >> >> ipc1 -> RadAccounting -> q; >> q -> uq::Unqueue -> aupcc::Counter -> sfp; >> >> StaticThreadSched(fd 0, uq 1, log 2); >> >> *Problem Description:* >> The 3 threads in the config: FromDAG and Unqueue do what their >> names >> indicate, and Logger has a timer that schedules its task once every >> 30min. Everything runs fine initially and then somewhere along the >> line Unqueue stops pulling packets from the queue which leads to a >> constant queue overflow. When this happens I can see FromDAG is >> still >> working as q.drops increases constantly, and Logger also fires away >> every 30min. Note push rate from FromDAG is quite high, but >> what RadAccounting pushes into queue is very low ~ 100pps, which >> means >> queue is most of the time empty and Unqueue is not scheduled. >> >> Unfortunately I didn't configure debug scheduling, which would've >> helped. I did notice Unqueue.scheduled always returned true, even >> though it never actually ran. Not sure if the following will help, >> gdb >> shows the following: >> >> (gdb) info threads >> 3 Thread 0x7fa422559700 (LWP 27205) 0x000000392cedb0b3 in poll >> () >> from /lib64/libc.so.6 >> 2 Thread 0x7fa421d58700 (LWP 27206) 0x000000392cedcee3 in >> select >> () from /lib64/libc.so.6 >> * 1 Thread 0x7fa432cfb740 (LWP 27197) FromDAG::process_packet >> (this=0x1afc570, erf_record=0x7fa423258bf0, rlen=<value optimized >> out>) at ../elements/local/fromdag.cc:193 >> (gdb) thread 2 >> [Switching to thread 2 (Thread 0x7fa421d58700 (LWP 27206))]#0 >> 0x000000392cedb0b3 in poll () from /lib64/libc.so.6 >> (gdb) bt >> #0 0x000000392cedb0b3 in poll () from /lib64/libc.so.6 >> #1 0x000000000058bc36 in Master::run_selects_poll (this=0x1afacc0, >> thread=0x1afb8b0, more_tasks=false) at ../lib/master.cc:862 >> #2 0x000000000058c4ec in Master::run_selects (this=0x1afacc0, >> thread=0x1afb8b0) at ../lib/master.cc:1050 >> #3 0x000000000057ddf1 in run_os (this=0x1afb8b0) at >> ../lib/routerthread.cc:447 >> #4 RouterThread::driver (this=0x1afb8b0) at >> ../lib/routerthread.cc:568 >> #5 0x0000000000556dc9 in thread_driver (user_data=<value optimized >> out>) at click.cc:414 >> #6 0x000000392d206d5b in start_thread () from >> /lib64/libpthread.so.0 >> #7 0x000000392cee4aad in clone () from /lib64/libc.so.6 >> (gdb) thread 3 >> [Switching to thread 3 (Thread 0x7fa422559700 (LWP 27205))]#0 >> 0x000000392cedcee3 in select () from /lib64/libc.so.6 >> (gdb) bt >> #0 0x000000392cedcee3 in select () from /lib64/libc.so.6 >> #1 0x000000000058c4bf in Master::run_selects (this=0x1afacc0, >> thread=0x1afb7e0) at ../lib/master.cc:1015 >> #2 0x000000000057ddf1 in run_os (this=0x1afb7e0) at >> ../lib/routerthread.cc:447 >> #3 RouterThread::driver (this=0x1afb7e0) at >> ../lib/routerthread.cc:568 >> #4 0x0000000000556dc9 in thread_driver (user_data=<value optimized >> out>) at click.cc:414 >> #5 0x000000392d206d5b in start_thread () from >> /lib64/libpthread.so.0 >> #6 0x000000392cee4aad in clone () from /lib64/libc.so.6 >> (gdb) thread 1 >> [Switching to thread 1 (Thread 0x7fa432cfb740 (LWP 27197))]#0 >> 0x000000392d20ebfd in nanosleep () from /lib64/libpthread.so.0 >> (gdb) bt >> #0 FromDAG::process_packet (this=0x1afc570, >> erf_record=0x7fa423258bf0, rlen=<value optimized out>) at >> ../elements/local/fromdag.cc:193 >> #1 0x00000000004d6402 in FromDAG::run_task (this=0x1afc570) at >> ../elements/local/fromdag.cc:150 >> #2 0x000000000057dbe6 in fire (this=0x1afb710) at >> ../include/click/task.hh:612 >> #3 run_tasks (this=0x1afb710) at ../lib/routerthread.cc:410 >> #4 RouterThread::driver (this=0x1afb710) at >> ../lib/routerthread.cc:600 >> #5 0x0000000000558513 in main (argc=<value optimized out>, >> argv=<value optimized out>) at click.cc:639 >> >> One thing to note, for various reasons I'm doing something very >> unclick-like with FromDAG where I allow it to block up to 10ms. For >> my >> specific requirements this is not a problem, but just in case it >> might >> affect the way the new task handling operates it's worth noting. >> >> Beyers >> >> On Sun, Feb 6, 2011 at 1:23 AM, Eddie Kohler <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hi all, >> >> This is just a note to say that I've merged the >> "coreperformance" >> branch >> with master. There are several changes that may speed up >> particularly >> simple configurations, and that (more importantly) may make it >> easier to >> experiment with different multithreading setups. (For instance >> I >> believe switching a task from one thread to another is faster >> now.) Let >> me know if you experience any problems >> >> Eddie >> _______________________________________________ >> click mailing list >> [email protected] <mailto: >> [email protected]> >> >> https://amsterdam.lcs.mit.edu/mailman/listinfo/click >> >> >> >> >> _______________________________________________ click mailing list [email protected] https://amsterdam.lcs.mit.edu/mailman/listinfo/click
