On Mon, Aug 12, 2019 at 12:38:55PM +0100, Filipe Manana wrote:
> On Tue, Aug 6, 2019 at 6:48 PM Omar Sandoval <osan...@osandov.com> wrote:
> >
> > From: Omar Sandoval <osan...@fb.com>
> >
> > We hit a the following very strange deadlock on a system with Btrfs on a
> > loop device backed by another Btrfs filesystem:
> >
> > 1. The top (loop device) filesystem queues an async_cow work item from
> >    cow_file_range_async(). We'll call this work X.
> > 2. Worker thread A starts work X (normal_work_helper()).
> > 3. Worker thread A executes the ordered work for the top filesystem
> >    (run_ordered_work()).
> > 4. Worker thread A finishes the ordered work for work X and frees X
> >    (work->ordered_free()).
> > 5. Worker thread A executes another ordered work and gets blocked on I/O
> >    to the bottom filesystem (still in run_ordered_work()).
> > 6. Meanwhile, the bottom filesystem allocates and queues an async_cow
> >    work item which happens to be the recently-freed X.
> > 7. The workqueue code sees that X is already being executed by worker
> >    thread A, so it schedules X to be executed _after_ worker thread A
> >    finishes (see the find_worker_executing_work() call in
> >    process_one_work()).
> >
> > Now, the top filesystem is waiting for I/O on the bottom filesystem, but
> > the bottom filesystem is waiting for the top filesystem to finish, so we
> > deadlock.
> >
> > This happens because we are breaking the workqueue assumption that a
> > work item cannot be recycled while it still depends on other work. Fix
> > it by waiting to free the work item until we are done with all of the
> > related ordered work.
> >
> > P.S.:
> >
> > One might ask why the workqueue code doesn't try to detect a recycled
> > work item. It actually does try by checking whether the work item has
> > the same work function (find_worker_executing_work()), but in our case
> > the function is the same. This is the only key that the workqueue code
> > has available to compare, short of adding an additional, layer-violating
> > "custom key". Considering that we're the only ones that have ever hit
> > this, we should just play by the rules.
> >
> > Unfortunately, we haven't been able to create a minimal reproducer other
> > than our full container setup using a compress-force=zstd filesystem on
> > top of another compress-force=zstd filesystem.
> >
> > Suggested-by: Tejun Heo <t...@kernel.org>
> > Signed-off-by: Omar Sandoval <osan...@fb.com>
> 
> Reviewed-by: Filipe Manana <fdman...@suse.com>
> 
> Looks good to me, thanks.
> Another variant of the problem Liu fixed back in 2014 (commit
> 9e0af23764344f7f1b68e4eefbe7dc865018b63d).

Good point. I think we can actually get rid of those unique helpers with
this fix. I'll send some followup cleanups.

Reply via email to