Guys,

Thank you so much for finding this issue!

Kevin, I've checked in your patch. But I wonder whether something else would be better. I forget, now, why the Linux version of RouterThread uses a non-recursive spinlock_t rather than a recursive Spinlock. Could someone who's got a performance testing setup check out whether the system is appreciably slower with a Spinlock? Something like the attached (untested)?

Eddie


[EMAIL PROTECTED] wrote:
Perfect Joonwoo!

commit 381b5f8ff5fe2a1d7f26ac39b5591b99c4365bae (Write handlers become
EXCLUSIVE by default) made all write handlers exclusive by default. When
click-uninstall writes to /click/config to clear the configuration it is
done with a now exclusive write handler (requiring all threads to be
locked). The action of the write handler then attempts to lock all tasks
again routerthread.cc:633. The write handler action is attempting to grab
a spinlock it already holds -> Deadlock.

The attached patch simply marks the write handler as NONEXCLUSIVE. The
proper fix would be to remove lock_tasks() from
RouterThread::unschedule_router_tasks,  if you can guarantee all routes
into the function have already locked the tasks.

Thanks!
Kevin



Hi all,

I bisected tree for this issue, and found following commit is the
first bad commit.

 commit a7dd5c93a51b3b5b8cd4a177ef379492c27859bf
 click-combine documentation: Mention that RouterLink is fake.

Any idea?

Thanks,
Joonwoo

2008/9/3  <[EMAIL PROTECTED]>:
Hi Eddie,

Sorry for the wait. Today I pulled the latest sources and rebuilt a
vanilla click kernel on a 64-bit 4 core Xeon machine (with the click
e1000
driver installed). The locking seems to be working fine, but
'click-uninstall' is resulting in a lockup. See the attached dumps
(click
was started with click-install -t 3 in both cases). I still see the
lockup
with -t 1.

Click uninstalls without problem with our version from a couple of weeks
back.

Unrelated, but I also saw:
../include/click/timestamp.hh: In constructor
'Timestamp::Timestamp(double)':
../include/click/timestamp.hh:938: warning: converting to 'int64_t' from
'double'
all over when building userlevel click.

Sorry I don't have the time to dive into this further right now.

Thanks!
Kevin




Kevin,

Is there any chance you'd be able to try the newer Spinlock code any
time
soon?

Eddie


[EMAIL PROTECTED] wrote:
Hi,

I think there is a concurrency issue with Spinlocks in linuxmodule
multi-threaded click (running a 2.6.19.2 patched kernel, the
e1000-NAPI
driver and today's trunk). I've put together a temporary patch, but
there
might be further issues. Sorry, I don't have the time to investigate
more.


Problems:
The series of events are:
-Thread A is running on CPU 0 and thread B is running on CPU 1.
-A acquires the Spinlock and executes code in the critical section.
-Linux schedules Thread A to run on CPU 1 and thread B to run on CPU
0.
-B acquires the Spinlock (works because CPU 0 is the owner of the
lock,
not the thread)
-A and B are both operating in the critical section
-(if you're lucky) A releases the Spinlock and generates a "releasing
someone else's lock" message
-(if you're unlucky) Oops

Attached is a test element and a configuration to elicit the behavior.
I
could only replicate the problem when using a polling input source
under
heavy load.

After fixing the above problem I was still seeing two threads inside
the
CS. For some reason the atomic_uint32_t::swap was not working
atomically
as expected. Changing this to atomic_uint32::compare_and_swap solved
the
problem. We're running x86_64 quad core Xeon CPUs.


Solution:
In the patch attached I added a new function click_current_thread()
which
returns a thread_info pointer. Inside of Spinlock I replace
click_current_processor() uses with click_current_thread(). This way
even
if the thread is assigned to another core it will still have the same
thread_info pointer.

The problem with this solution is that it does not work as an index
into
an array nicely and cannot simply replace all uses of
click_current_processor(). There may be other places in the code which
make the same incorrect use of click_current_processor(), but a better
approach will be needed if the value is used to index into an array
(such
as in ReadWriteLock).


Thanks!
Kevin Springborn


------------------------------------------------------------------------

_______________________________________________
click mailing list
[email protected]
https://amsterdam.lcs.mit.edu/mailman/listinfo/click
_______________________________________________
click mailing list
[email protected]
https://amsterdam.lcs.mit.edu/mailman/listinfo/click


diff --git a/include/click/routerthread.hh b/include/click/routerthread.hh
index 48a5487..0ec576e 100644
--- a/include/click/routerthread.hh
+++ b/include/click/routerthread.hh
@@ -99,13 +99,11 @@ class RouterThread
 
 #if CLICK_LINUXMODULE
     struct task_struct *_linux_task;
-    spinlock_t _lock;
-    atomic_uint32_t _task_lock_waiting;
 #elif HAVE_MULTITHREAD
     click_processor_t _running_processor;
+#endif
     Spinlock _lock;
     atomic_uint32_t _task_lock_waiting;
-#endif
     
     uint32_t _any_pending;
 
@@ -283,7 +281,7 @@ RouterThread::lock_tasks()
 #if CLICK_LINUXMODULE
     if (unlikely(current != _linux_task)) {
 	_task_lock_waiting++;
-	spin_lock(&_lock);
+	_lock.acquire();
 	_task_lock_waiting--;
     }
 #elif HAVE_MULTITHREAD
@@ -299,7 +297,7 @@ RouterThread::attempt_lock_tasks()
 #if CLICK_LINUXMODULE
     if (likely(current == _linux_task))
 	return true;
-    return spin_trylock(&_lock);
+    return _lock.attempt();
 #elif HAVE_MULTITHREAD
     return _lock.attempt();
 #else
@@ -312,7 +310,7 @@ RouterThread::unlock_tasks()
 {
 #if CLICK_LINUXMODULE
     if (unlikely(current != _linux_task))
-	spin_unlock(&_lock);
+	_lock.release();
 #elif HAVE_MULTITHREAD
     _lock.release();
 #endif
diff --git a/lib/routerthread.cc b/lib/routerthread.cc
index 6164d9b..53b5bbe 100644
--- a/lib/routerthread.cc
+++ b/lib/routerthread.cc
@@ -85,7 +85,6 @@ RouterThread::RouterThread(Master *m, int id)
 #if CLICK_LINUXMODULE
     _linux_task = 0;
     _task_lock_waiting = 0;
-    spin_lock_init(&_lock);
 #elif HAVE_MULTITHREAD
     _running_processor = click_invalid_processor();
     _task_lock_waiting = 0;
@@ -146,7 +145,7 @@ RouterThread::driver_lock_tasks()
 #if CLICK_LINUXMODULE
     for (int i = 0; _task_lock_waiting > 0 && i < 10; i++)
 	schedule();
-    spin_lock(&_lock);
+    _lock.acquire();
 #elif HAVE_MULTITHREAD
 # if CLICK_USERLEVEL
     for (int i = 0; _task_lock_waiting > 0 && i < 10; i++) {
@@ -161,9 +160,7 @@ RouterThread::driver_lock_tasks()
 inline void
 RouterThread::driver_unlock_tasks()
 {
-#if CLICK_LINUXMODULE
-    spin_unlock(&_lock);
-#elif HAVE_MULTITHREAD
+#if CLICK_LINUXMODULE || HAVE_MULTITHREAD
     _lock.release();
 #endif
 }
@@ -605,7 +602,7 @@ RouterThread::driver_once()
 #if CLICK_LINUXMODULE
     // this task is running the driver
     _linux_task = current;
-    spin_lock(&_lock);
+    _lock.acquire();
 #elif HAVE_MULTITHREAD
     _running_processor = click_current_processor();
     _lock.acquire();
@@ -619,7 +616,7 @@ RouterThread::driver_once()
     splx(s);
 #endif
 #if CLICK_LINUXMODULE
-    spin_unlock(&_lock);
+    _lock.release();
     _linux_task = 0;
 #elif HAVE_MULTITHREAD
     _lock.release();
_______________________________________________
click mailing list
[email protected]
https://amsterdam.lcs.mit.edu/mailman/listinfo/click

Reply via email to