Hi,

Matt Welsh wrote:
> 
> 
> Hi folks,
> 
> I have a simple Java program where 2 threads spin in a tight loop each 
> grabbing the same lock and releasing it. This is on Linux x86 and has been
> tested using GCJ 2.95.1, Blackdown JDK 1.1.7v3 (native threads), and 
> IBM JDK 1.1.8 (native threads). Note that I am on an SMP system (IBM
> Netfinity Dual PIII Xeon). 
> 
> When the lock is uncontended, performance is fine: about 2,000 loop 
> iterations per millisecond. But with more than one thread trying to 
> grab the lock, performance decreases considerably: down to 25 or 30 
> iters/millisecond! 
> 

I believe that this problem may be caused by the lack of proper support 
for user-level threading in the linux kernel.

Consider what linux-threads has to go through for a contended lock:
This code is from linux-threads 0.71:

int pthread_mutex_lock(pthread_mutex_t * mutex)
{
  pthread_descr self;
  while(1) {
    acquire(&mutex->m_spinlock);
    switch(mutex->m_kind) {
    case PTHREAD_MUTEX_FAST_NP:
      if (mutex->m_count == 0) {
        mutex->m_count = 1;
        release(&mutex->m_spinlock);
        return 0;
      }
      self = thread_self();
      break;
        ....
    }
//////// CONTENDED CASE ////////////
    /* Suspend ourselves, then try again */
    enqueue(&mutex->m_waiting, self);
    release(&mutex->m_spinlock);
    suspend(self); /* This is not a cancellation point */
  }
}

acquire is implemented as (while !test_and_set() __sched_yield())),
which is usually uncontended and *not* the source of the slowdown.

Now, suspend is implemented like so:

static inline void suspend(pthread_descr self)
{
  sigset_t mask;

  sigprocmask(SIG_SETMASK, NULL, &mask); /* Get current signal mask */
  sigdelset(&mask, PTHREAD_SIG_RESTART); /* Unblock the restart signal */
  do {
    self->p_signal = 0;
    sigsuspend(&mask);                   /* Wait for signal */
  } while (self->p_signal != PTHREAD_SIG_RESTART);
}

In other words, every contended case implies that the two processes
exchange user-level signals.   (linux-threads uses SIGUSR1/SIGUSR2.)

What kind of performance can one expect with such a scheme?

On my dual PII/350 machine, I see ca. 600 iters/ms vs. 2000 iters/ms
as reported by Matt's program.

If I limit each thread to 50 measurements, I see the following
total runtimes as reported by time:
1 thread case:
2.470u 0.000s 0:02.47 100.0%    0+0k 0+0io 99pf+0w
2 thread case:
5.780u 6.540s 0:07.99 154.1%    0+0k 0+0io 99pf+0w

Note the 6.5 seconds spent in system code, plus
the additional 0.8 seconds or so spent in user mode (presumably in 
signal handling related code)

One thing I'm wondering about is Linux's support for SYSV-style IPC
(semop etc.).  Those appear to be kernel-supported semaphores.  Maybe
the problem is not in the lack of kernel support, but could be fixed
by using SysV semaphores instead?  It would definitely be worth a try.

On the other hand, aren't sysv semaphores optional when you configure
your kernel?  As such, there may be reason to not hardcode reliance
on them in a glibc binary distribution(?)  glibc could always
test for it at run-time, though.

I believe Xavier Leroy and/or the glibc people may be the right folks
to ask.  I've cc'd Xavier on this mail, maybe he has some insights
as to what whether using semop would be an option.

I should mention that this problem is also affecting the native threads
version of kaffe, where we're also seeing slowdowns for highly contended
locks.

        - Godmar


----------------------------------------------------------------------
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Reply via email to