Guys,
Thank you so much for finding this issue!
Kevin, I've checked in your patch. But I wonder whether something else would
be better. I forget, now, why the Linux version of RouterThread uses a
non-recursive spinlock_t rather than a recursive Spinlock. Could someone
who's got a performance testing setup check out whether the system is
appreciably slower with a Spinlock? Something like the attached (untested)?
Eddie
[EMAIL PROTECTED] wrote:
Perfect Joonwoo!
commit 381b5f8ff5fe2a1d7f26ac39b5591b99c4365bae (Write handlers become
EXCLUSIVE by default) made all write handlers exclusive by default. When
click-uninstall writes to /click/config to clear the configuration it is
done with a now exclusive write handler (requiring all threads to be
locked). The action of the write handler then attempts to lock all tasks
again routerthread.cc:633. The write handler action is attempting to grab
a spinlock it already holds -> Deadlock.
The attached patch simply marks the write handler as NONEXCLUSIVE. The
proper fix would be to remove lock_tasks() from
RouterThread::unschedule_router_tasks, if you can guarantee all routes
into the function have already locked the tasks.
Thanks!
Kevin
Hi all,
I bisected tree for this issue, and found following commit is the
first bad commit.
commit a7dd5c93a51b3b5b8cd4a177ef379492c27859bf
click-combine documentation: Mention that RouterLink is fake.
Any idea?
Thanks,
Joonwoo
2008/9/3 <[EMAIL PROTECTED]>:
Hi Eddie,
Sorry for the wait. Today I pulled the latest sources and rebuilt a
vanilla click kernel on a 64-bit 4 core Xeon machine (with the click
e1000
driver installed). The locking seems to be working fine, but
'click-uninstall' is resulting in a lockup. See the attached dumps
(click
was started with click-install -t 3 in both cases). I still see the
lockup
with -t 1.
Click uninstalls without problem with our version from a couple of weeks
back.
Unrelated, but I also saw:
../include/click/timestamp.hh: In constructor
'Timestamp::Timestamp(double)':
../include/click/timestamp.hh:938: warning: converting to 'int64_t' from
'double'
all over when building userlevel click.
Sorry I don't have the time to dive into this further right now.
Thanks!
Kevin
Kevin,
Is there any chance you'd be able to try the newer Spinlock code any
time
soon?
Eddie
[EMAIL PROTECTED] wrote:
Hi,
I think there is a concurrency issue with Spinlocks in linuxmodule
multi-threaded click (running a 2.6.19.2 patched kernel, the
e1000-NAPI
driver and today's trunk). I've put together a temporary patch, but
there
might be further issues. Sorry, I don't have the time to investigate
more.
Problems:
The series of events are:
-Thread A is running on CPU 0 and thread B is running on CPU 1.
-A acquires the Spinlock and executes code in the critical section.
-Linux schedules Thread A to run on CPU 1 and thread B to run on CPU
0.
-B acquires the Spinlock (works because CPU 0 is the owner of the
lock,
not the thread)
-A and B are both operating in the critical section
-(if you're lucky) A releases the Spinlock and generates a "releasing
someone else's lock" message
-(if you're unlucky) Oops
Attached is a test element and a configuration to elicit the behavior.
I
could only replicate the problem when using a polling input source
under
heavy load.
After fixing the above problem I was still seeing two threads inside
the
CS. For some reason the atomic_uint32_t::swap was not working
atomically
as expected. Changing this to atomic_uint32::compare_and_swap solved
the
problem. We're running x86_64 quad core Xeon CPUs.
Solution:
In the patch attached I added a new function click_current_thread()
which
returns a thread_info pointer. Inside of Spinlock I replace
click_current_processor() uses with click_current_thread(). This way
even
if the thread is assigned to another core it will still have the same
thread_info pointer.
The problem with this solution is that it does not work as an index
into
an array nicely and cannot simply replace all uses of
click_current_processor(). There may be other places in the code which
make the same incorrect use of click_current_processor(), but a better
approach will be needed if the value is used to index into an array
(such
as in ReadWriteLock).
Thanks!
Kevin Springborn
------------------------------------------------------------------------
_______________________________________________
click mailing list
[email protected]
https://amsterdam.lcs.mit.edu/mailman/listinfo/click
_______________________________________________
click mailing list
[email protected]
https://amsterdam.lcs.mit.edu/mailman/listinfo/click
diff --git a/include/click/routerthread.hh b/include/click/routerthread.hh
index 48a5487..0ec576e 100644
--- a/include/click/routerthread.hh
+++ b/include/click/routerthread.hh
@@ -99,13 +99,11 @@ class RouterThread
#if CLICK_LINUXMODULE
struct task_struct *_linux_task;
- spinlock_t _lock;
- atomic_uint32_t _task_lock_waiting;
#elif HAVE_MULTITHREAD
click_processor_t _running_processor;
+#endif
Spinlock _lock;
atomic_uint32_t _task_lock_waiting;
-#endif
uint32_t _any_pending;
@@ -283,7 +281,7 @@ RouterThread::lock_tasks()
#if CLICK_LINUXMODULE
if (unlikely(current != _linux_task)) {
_task_lock_waiting++;
- spin_lock(&_lock);
+ _lock.acquire();
_task_lock_waiting--;
}
#elif HAVE_MULTITHREAD
@@ -299,7 +297,7 @@ RouterThread::attempt_lock_tasks()
#if CLICK_LINUXMODULE
if (likely(current == _linux_task))
return true;
- return spin_trylock(&_lock);
+ return _lock.attempt();
#elif HAVE_MULTITHREAD
return _lock.attempt();
#else
@@ -312,7 +310,7 @@ RouterThread::unlock_tasks()
{
#if CLICK_LINUXMODULE
if (unlikely(current != _linux_task))
- spin_unlock(&_lock);
+ _lock.release();
#elif HAVE_MULTITHREAD
_lock.release();
#endif
diff --git a/lib/routerthread.cc b/lib/routerthread.cc
index 6164d9b..53b5bbe 100644
--- a/lib/routerthread.cc
+++ b/lib/routerthread.cc
@@ -85,7 +85,6 @@ RouterThread::RouterThread(Master *m, int id)
#if CLICK_LINUXMODULE
_linux_task = 0;
_task_lock_waiting = 0;
- spin_lock_init(&_lock);
#elif HAVE_MULTITHREAD
_running_processor = click_invalid_processor();
_task_lock_waiting = 0;
@@ -146,7 +145,7 @@ RouterThread::driver_lock_tasks()
#if CLICK_LINUXMODULE
for (int i = 0; _task_lock_waiting > 0 && i < 10; i++)
schedule();
- spin_lock(&_lock);
+ _lock.acquire();
#elif HAVE_MULTITHREAD
# if CLICK_USERLEVEL
for (int i = 0; _task_lock_waiting > 0 && i < 10; i++) {
@@ -161,9 +160,7 @@ RouterThread::driver_lock_tasks()
inline void
RouterThread::driver_unlock_tasks()
{
-#if CLICK_LINUXMODULE
- spin_unlock(&_lock);
-#elif HAVE_MULTITHREAD
+#if CLICK_LINUXMODULE || HAVE_MULTITHREAD
_lock.release();
#endif
}
@@ -605,7 +602,7 @@ RouterThread::driver_once()
#if CLICK_LINUXMODULE
// this task is running the driver
_linux_task = current;
- spin_lock(&_lock);
+ _lock.acquire();
#elif HAVE_MULTITHREAD
_running_processor = click_current_processor();
_lock.acquire();
@@ -619,7 +616,7 @@ RouterThread::driver_once()
splx(s);
#endif
#if CLICK_LINUXMODULE
- spin_unlock(&_lock);
+ _lock.release();
_linux_task = 0;
#elif HAVE_MULTITHREAD
_lock.release();
_______________________________________________
click mailing list
[email protected]
https://amsterdam.lcs.mit.edu/mailman/listinfo/click