Re: Why is Linux thread locking so slow?

Godmar Back Mon, 18 Oct 1999 08:50:48 -0700

Thanks for your answer, Xavier.

Let me reply to some points below and ask some more questions.
I hope this discussion remains interesting and relevant to the
other subscribers on this list.

> 
> > > I have a simple Java program where 2 threads spin in a tight loop each 
> > > grabbing the same lock and releasing it. This is on Linux x86 and has been
> > > tested using GCJ 2.95.1, Blackdown JDK 1.1.7v3 (native threads), and 
> > > IBM JDK 1.1.8 (native threads). Note that I am on an SMP system (IBM
> > > Netfinity Dual PIII Xeon). 
> > > 
> > > When the lock is uncontended, performance is fine: about 2,000 loop 
> > > iterations per millisecond. But with more than one thread trying to 
> > > grab the lock, performance decreases considerably: down to 25 or 30 
> > > iters/millisecond! 
> 
> I'm not surprised.  Given the 1-to-1 implementation model for
> LinuxThread, each time a thread needs to suspend (e.g. because it's
> waiting on a locked mutex), it needs to go through the kernel and
> trigger a context switch.  Context switch in the Linux kernel is
> pretty efficient compared with other Unix kernels, but still on the
> order of 20 to 40 microseconds.

Clearly, in the uniprocessor case, linux-threads is handicapped by
having to ask the kernel to switch between threads, because of the
1-to-1 implementation model.  I fully agree with that.

However, I think the case we're looking at here may be slightly different.
In our case, one thread (A) is running on one CPU, and another thread (B) 
is running on another.  Thread A holds a lock, thread B wants to acquire it.
Once thread A unlocks the lock, thread B could just go ahead and run
on the other CPU.  In this case, *neither* CPU would have to context
switch.  It would simply be a system call executed on each CPU.

Am I missing anything here?

> 
> User-level threads or two-level threads fare better in these
> circumstances, but are less efficient for doing I/O.  Also, Linux
> lacks the kernel support required for proper two-level scheduling.

May I ask what you mean by two-level scheduling?

Do you mean a hybrid threading scheme similar to Solaris, where some
threads are scheduled as user-level threads at the process level, while
the kernel only schedules fewer kernel-level threads (aka LWP), on top
of which user-level threads are scheduled?  
Such a model would eliminate the caveat that every thread switch would
incur a full kernel context switch.

Or do you have something else in mind here?

> 
> > I believe that this problem may be caused by the lack of proper support 
> > for user-level threading in the linux kernel.
> > One thing I'm wondering about is Linux's support for SYSV-style IPC
> > (semop etc.).  Those appear to be kernel-supported semaphores.  Maybe
> > the problem is not in the lack of kernel support, but could be fixed
> > by using SysV semaphores instead?  It would definitely be worth a try.
> 
> Using SysV semaphores as a suspend/resume mechanism for threads would
> be no more efficient than the current signal-based implementation:
> you'd still incur a kernel context switch for each suspend and resume
> implementation.

Let me point out though that your suspend operation right now
incurs two system calls (sigprocmask and sigsuspend); this seems more
expensive that a single semaphore_lock(), if it existed in usable form.

Finally, I believe that a certain portion of the cost doesn't necessarily 
stem from the costs involved in the context switch; it stems from the fact 
that on top of the context switch you're going through the signal
delivery path.  This adds unnecessary overhead, especially in the case 
described above where the two threads run on different CPUs.  Couldn't
such overhead be avoid by having a dedicated lock/unlock system call?

> 
> Using SysV semaphores as an alternative to mutexes is even worse:
> you'd pay the cost of a system call on each mutex operation, not just
> on those that contend for the mutex.

Is this really true?
Why would an implementation that uses test-and-set at user-level
and that would fall back on the kernel lock/unlock construct not be 
applicable in this case as well as in the signal-based implementation?

But as I've said, I completely agree with the other caveats of
SysV5 IPC.  I believe the reason I brought it up is to ask what we
would gain by having kernel supported mutexes, if they could be
provided in a usable form.

> 
> The last thing I want to say is that regardless of the thread library,
> high contention on mutexes will result in bad performance for your
> application, especially on multiprocessors (because contention means
> you're not taking advantage of all available parallelism).  Thus,
> properly written multi-threaded applications have very low mutex
> contention, and consequently all thread libraries emphasize fast mutex
> operations when there is no contention and don't worry too much about
> performance in the other cases.  You need to benchmark real
> applications, not just silly micro-benchmarks, before coming to
> conclusions.
> 

I strongly agree with you that reducing contention is a must.  However,
let me also point out that we didn't encounter the problems we describe
in microbenchmarks; we're also not trying to draw conclusions: we encountered 
them in real applications (Kaffe and whatever Matt was doing with gcj).  
So, to a certain extent the question becomes: what support do we want to 
provide for applications that are not designed for SMP, but which we would 
like to run on an SMP nevertheless.

Structuring programs to reduce contention is hard and requires effort,
and in some cases it is very hard.  [1] It involves partitioning resources
which is not always possible.  Don't get me wrong, I am aware of 
implementation techniques that can reduce contention, such as per-thread
memory allocation arenas to reduce contention for the memory allocator
lock, for instance.  I am also willing to tolerate some penalty if somebody 
writes an application with a bad locking strategy.    On the other hand,
there's already a lot of code with bad locking strategies out there.
Take the java class libraries as an example. [2]

Therefore, I don't believe it is unreasonable for us to think about an
SMP lock implementation that optimizes the handling of contended locks
in order to make the penalty not as harsh.

        - Godmar

[1]
http://gatekeeper.dec.com/pub/DEC/SRC/technical-notes/abstracts/src-tn-1997-016.html
(This paper discovered  a severe contention point in the Alpha/NT kernel---
well back then when there still was NT for alpha)
[2]
http://www.research.digital.com/SRC/mercator/papers/Java99/final.html


----------------------------------------------------------------------
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Re: Why is Linux thread locking so slow?

Reply via email to