On Wed, 8 Jan 2014 19:51:59 +0100, David Sterba wrote:
On Wed, Jan 08, 2014 at 02:25:02PM +0800, Qu Wenruo wrote:
But according to the dmesg, which is expired now though, I made the
following assumption:
There is a possibility that out of order execution made the "work->wq =
wq" sentence executed behind the "queue_work()" call,
and the "normal_work_helper" is executed in another CPU instantly, which
caused the problem.
This could be the reason, though I still don't see how exactly this
happens.
------
static inline void __btrfs_queue_work(struct __btrfs_workqueue *wq,
struct btrfs_work *work)
{
unsigned long flags;
work->wq = wq;
/* There is no memory barrier ensure the work->wq = wq is executed
before queue_work */
thresh_queue_hook(wq);
if (work->ordered_func) {
spin_lock_irqsave(&wq->list_lock, flags);
list_add_tail(&work->ordered_list, &wq->ordered_list);
spin_unlock_irqrestore(&wq->list_lock, flags);
}
queue_work(wq->normal_wq, &work->normal_work);
}
------
The fix is as easy as just adding a smp_wmb() behind the "work->wq = wq"
sentence, but since I failed to reproduce the bug,
I can't confirm whether the fix is good or not.
I know it's impolite but would you please first try the smp_wmb() first to
confirm the fix?
If itis convenient for you,would you please give some feedback about the
smp_wmb() fix ?
Also incase that smp_wmb() doesn't work, it would be very nice of you to
also try smp_mb() fix.
The smp_wmb barriers have to pair with smp_rmb, which means that
normal_work_helper() has to call smp_rmb at the beginning.
What still puzzles me is why would the barriers are needed at all,
although your example code shows the effects of a delayed store.
Down the call stack, the barrier takes place:
queue_work
queue_work_on(WORK_CPU_UNBOUND)
__queue_work
insert_work
smp_mb
There is no barrier before the work is being processed in
workqueue.c::process_one_work(), but there maybe an implied one.
david
It is right that before processing the work, there is already an
smp_mb() in queue_work.
So the problem may not be the memory barrier things.
It looks like I have to dig deeper even I could not reproduce the bug. :(
BTW, since the original dmesg is expired, would josef upload the dmesg
again?
Also, did david success to reproduce the bug?
Thanks
Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html