Thanks for the heads up, I will update. On Wed, Oct 20, 2010 at 8:16 AM, Cliff Frey <[email protected]> wrote:
> Beyers, > > Eddie and I discovered and fixed a few other (very evil) memory barrier > bugs, so you should probably update. > > Cliff > > On Fri, Oct 15, 2010 at 7:40 AM, Eddie Kohler <[email protected]> wrote: > >> Great, glad it's fixed! >> E >> >> >> >> On 10/15/2010 06:35 AM, Beyers Cronje wrote: >> >>> Hi Eddie, >>> >>> I'm happy to report that I've been unable to reproduce the bug using the >>> latest commit. So as far as I can tell this fixed the issue I was having. >>> Eddie, Cliff, you guys rock, thanks so much for your assistance on this. >>> Kind regards >>> Beyers Cronje >>> >>> On Fri, Oct 15, 2010 at 6:24 AM, Eddie Kohler <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Beyers, >>> >>> I believe I've found the problem... or A problem. What does this >>> commit do for you? >>> >>> >>> http://github.com/kohler/click/commit/96471bf1c0873447de007969f5a1e49b1a947071 >>> >>> Eddie >>> >>> >>> >>> On 10/13/2010 05:01 PM, Beyers Cronje wrote: >>> >>> Hi Eddie, >>> >>> Thank you for your response and looking into this. I keep >>> forgetting >>> about your excellent Script element, during my tests I used a >>> much more >>> rudimentary method of verifying the issue :) I used a Counter >>> element >>> before and after the queue and manually read the counter's rate >>> and >>> count handlers. After a while I noticed that both counters rate >>> dropped >>> to 0 and and both counters count stopped increasing, >>> *permanently*. From >>> what I can see the tasks NEVER re-schedule themselves. >>> The only way I found to reschedule them was to manually issue a >>> reset on >>> the queue, which seems to re-activate the notifier signals and >>> in turn >>> schedule the tasks. >>> One thing I forgot to mention, I set SLEEPINESS_TRIGGER to 1, >>> this seems >>> to trigger the condition quicker. Also, strangely if I set the >>> first >>> InfiniteSoure LIMIT to 1000 instead of -1 (as per original config >>> posted) it also seems to trigger the condition much quicker, in >>> fact I >>> couldn't reproduce the issue with both InfiniteSource elements >>> set to -1 >>> during a 10min test run where with LIMIT 1000 it always happens >>> within >>> 10min of testing time. >>> I'm out of the office now, but will test with both your script and >>> Cliff's script tomorrow and post back my results. >>> Beyers >>> >>> On Wed, Oct 13, 2010 at 5:06 PM, Eddie Kohler >>> <[email protected] <mailto:[email protected]> >>> <mailto:[email protected] <mailto:[email protected]>>> wrote: >>> >>> Hi Beyers! >>> >>> I am not sure that I agree with your analysis about the queue >>> getting "stuck." >>> >>> Here is the configuration I used, a tweak of yours: >>> >>> ////////////////////////////////////////////////////////////// >>> >>> is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef 00 00 00 00 00 >>> 00 08 00>, >>> LIMIT -1, STOP false) -> q1::FullNoteQueue -> uq1::Unqueue >>> -> Discard; >>> >>> is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef 00 00 00 00 00 >>> 00 08 00>, >>> LIMIT -1, STOP false) -> q2::FullNoteQueue -> uq2::Unqueue >>> -> Discard; >>> >>> >>> StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3); >>> Script(TYPE ACTIVE, >>> init x 0, >>> set x $(add $x 1), >>> goto foo $(or $(is1.scheduled) $(uq1.scheduled)), >>> print "$x: is1.scheduled false, uq1.scheduled false, >>> q1.length $(q1.length)", >>> label foo, >>> goto bar $(or $(is2.scheduled) $(uq2.scheduled)), >>> print "$x: is2.scheduled false, uq2.scheduled false, >>> q2.length $(q2.length)", >>> label bar, >>> wait 0.5s, >>> loop); >>> ////////////////////////////////////////////////////////////// >>> >>> This should print a message whenever neither source nor sink >>> is >>> scheduled. >>> >>> I also tried this with "FullNoteQueue" replaced with >>> "ThreadSafeQueue." >>> >>> If a queue was truly getting 'stuck,' we would expect to see >>> something like >>> >>> 745: is1.scheduled false, uq1.scheduled false >>> 746: is1.scheduled false, uq1.scheduled false >>> 747: is1.scheduled false, uq1.scheduled false >>> 748: is1.scheduled false, uq1.scheduled false >>> 749: is1.scheduled false, uq1.scheduled false.... >>> >>> with no gaps, because something would go unscheduled *and >>> stay that >>> way*. >>> >>> But that's not what I see. I see this: >>> >>> 1: is1.scheduled false, uq1.scheduled false, q1.length 17 >>> 1: is2.scheduled false, uq2.scheduled false, q2.length 0 >>> 20: is1.scheduled false, uq1.scheduled false, q1.length 14 >>> 22: is1.scheduled false, uq1.scheduled false, q1.length 0 >>> 28: is1.scheduled false, uq1.scheduled false, q1.length 14 >>> 41: is1.scheduled false, uq1.scheduled false, q1.length 0 >>> 47: is2.scheduled false, uq2.scheduled false, q2.length 0 >>> 55: is2.scheduled false, uq2.scheduled false, q2.length 0 >>> 56: is2.scheduled false, uq2.scheduled false, q2.length 0 >>> 76: is2.scheduled false, uq2.scheduled false, q2.length 0 >>> >>> *** Now this, on the other hand, I would expect. In a >>> multithreaded >>> environment, notification will cause some delay between the >>> upstream >>> & downstream sides of a queue. This is because the upstream side >>> must wake up the downstream side, or vice versa. The >>> cross-thread >>> communication required for this can happen a bit slowly, >>> leaving a >>> moment (maybe more than a moment) where neither task appears >>> to be >>> scheduled. (In fact, at least one of the tasks will be >>> "pre-scheduled.") >>> >>> So my questions for you are this. Did you really mean >>> "stuck"? >>> Have you verified that the tasks NEVER re-schedule >>> themselves? Do >>> you agree with my analysis? Or, are you saying that the >>> "pauses" >>> are too long, even though the tasks do reschedule themselves? >>> >>> Eddie >>> >>> >>> >>> On 10/12/2010 05:23 PM, Beyers Cronje wrote: >>> >>> Hi Eddie, list, >>> >>> I've come across a multithreading bug in queue signal >>> handling >>> which I can >>> reproduce fairly consistently with userlevel >>> multithreading >>> Click. The >>> symptoms are that either the upstream or downstream task >>> from >>> the queue are >>> unscheduled even though the notifier signal is active, >>> this >>> occurs when the >>> queue either becomes full or empty. To illustrate here >>> is the >>> config I use >>> and some debug handler outputs: >>> >>> //******************** Config ************************ >>> is1::InfiniteSource(DATA \<00 00 c0 ae 67 ef 00 00 00 >>> 00 00 00 >>> 08 00>, >>> LIMIT 1000, STOP false) -> ThreadSafeQueue -> >>> uq1::Unqueue -> >>> Discard; >>> is2::InfiniteSource(DATA \<00 00 c0 ae 67 ef 00 00 00 >>> 00 00 00 >>> 08 00>, >>> LIMIT -1, STOP false) -> q::ThreadSafeQueue -> >>> uq2::Unqueue -> >>> Discard; >>> >>> StaticThreadSched(is1 0, uq1 1, is2 2, uq2 3); >>> >>> >>> //******************** Debug Handler Output when >>> upstream push >>> task is stuck >>> ************************ >>> read q.length >>> 200 Read handler 'q.length' OK >>> DATA 1 >>> 0 >>> >>> read q.fullnote_state >>> 200 Read handler 'q.fullnote_state' OK >>> DATA 131 >>> empty notifier off >>> task 0x25b8830 [uq2 :: Unqueue] unscheduled >>> full notifier on >>> task 0x25b8350 [is2 :: InfiniteSource] unscheduled >>> >>> //******************** Debug Handler Output when >>> downstream pull >>> task is >>> stuck ************************ >>> read q.length >>> 200 Read handler 'q.length' OK >>> DATA 4 >>> 1000 >>> >>> read q.fullnote_state >>> 200 Read handler 'q.fullnote_state' OK >>> DATA 131 >>> empty notifier on >>> task 0x1c6f830 [uq2 :: Unqueue] unscheduled >>> full notifier off >>> task 0x1c6f350 [is2 :: InfiniteSource] unscheduled >>> >>> >>> >>> >>> //***************************************************************************************************************** >>> >>> Clearly the notifier states are correct, but somehow the >>> relevant task is >>> not rescheduled. The above config uses ThreadSafeQueue but >>> I >>> verified that >>> the same issue occurs when using FullNoteQueue. >>> The obvious places to look are >>> >>> >>> >>> ActiveNotifier::set_active,FullNoteQueue::push_success/push_failure/pull_success/pull_failure >>> but so far I haven't spotted anything wrong with the >>> relevant >>> code, clearly >>> I'm overlooking something. >>> >>> Have you or anyone else on the list got any suggestions? >>> >>> If it helps, I'm running click source from a couple of >>> weeks >>> back, default >>> 64bit Fedora Core 13 kernel with preemption enabled >>> (2.6.33.3-85.fc13.x86_64 ), Intel Dual-Core CPU. >>> I start userlevel click with the following command: 'click >>> --threads=4 >>> conf/threadtest.click -p 777' >>> >>> Cheers >>> >>> Beyers >>> _______________________________________________ >>> click mailing list >>> [email protected] <mailto:[email protected]> >>> <mailto:[email protected] >>> <mailto:[email protected]>> >>> >>> https://amsterdam.lcs.mit.edu/mailman/listinfo/click >>> >>> >>> >>> > _______________________________________________ click mailing list [email protected] https://amsterdam.lcs.mit.edu/mailman/listinfo/click
