It is wrong to release the mutex (unlock) before the wait. The wait is atomically releasing the lock, wait on the condition to be signalled than acquire it back for the current thread before returning.
_____ From: Andrija Petrovic [mailto:[EMAIL PROTECTED] Sent: 16 April 2007 18:19 To: devel Devel Subject: Re: Race condition in list_consume() Perhaps I'm missing the point here, but I'm not sure why B should lock() the list and then wait for somebody else to alter the list through gwlist_remove_producer(). It sound perfectly reasonable to me to lock() and later unlock() the list in gwlist_remove_producer(). Furthermore, putting unlock as the first line in gwlist_consume does not sound logical to me. So, I see no need to change anything in gwlib. AFAIK, the whole problem arises from problematic usage of gwlib by B. Correct me if I'm wrong, but, perhaps... B does: lock(list); /* atomic lock */ list->single_operation_lock->owner = -1; unlock(list); /*let the list be altered from some other thread, and then wait for that alteration*/ pthread_cond_wait(&list->nonempty, &list->single_operation_lock->mutex); There is, however, the basic question behind the whole problem: why is list->single_operation_lock->owner set to -1? If that change was the point of lock()ing and unlock()ing the list, perhaps a change in software design could help... Andrija Petrovic Andreas Fink wrote: We found a severe bug in gwlib. We have the following scenario: A calls debug("xxx",0,"xxxx") which does : gwlist_add_producer(writers); and continues but doesnt reach yet this line: gwlist_remove_producer(writers); at this point the list "writers" is empty but has writers->num_producers=1 B does: lock(list); /* atomic lock */ list->single_operation_lock->owner = -1; pthread_cond_wait(&list->nonempty, &list->single_operation_lock->mutex); so it waits until A is calling gwlist_remove_producer() and wait until A completes. Now A is calling this: void gwlist_remove_producer(List *list) { lock(list); gw_assert(list->num_producers > 0); --list->num_producers; pthread_cond_broadcast(&list->nonempty); unlock(list); } and gets locked up because the list's atomic lock is locked by B. C now has a new debug message and gets stopped at gwlist_produce(). In other words, every process who wants to write to debug log gets stuck. Now there is different solutions to this. Our approach would be to do in gwlist_consume() to do this: unlock(list); pthread_cond_wait(&list->nonempty, &list->single_operation_lock->mutex); lock(list); Any other ideas? maybe no atomic lock around gwlist_remove_producer() ? Andreas Fink Fink Consulting GmbH Global Networks Schweiz AG BebbiCell AG --------------------------------------------------------------- Tel: +41-61-6666330 Fax: +41-61-6666331 Mobile: +41-79-2457333 Address: Clarastrasse 3, 4058 Basel, Switzerland E-Mail: [EMAIL PROTECTED] www.finkconsulting.com www.global-networks.ch www.bebbicell.ch --------------------------------------------------------------- ICQ: 8239353 MSN: [EMAIL PROTECTED] AIM: smsrelay Skype: andreasfink Yahoo: finkconsulting SMS: +41792457333
