On Fri, Feb 02, 2024 at 12:11:09PM -0300, Fabiano Rosas wrote:
> Cédric Le Goater <c...@redhat.com> writes:
> 
> > On 2/2/24 15:42, Fabiano Rosas wrote:
> >> Cédric Le Goater <c...@redhat.com> writes:
> >> 
> >>> In case of error, close_return_path_on_source() can perform a shutdown
> >>> to exit the return-path thread.  However, in migrate_fd_cleanup(),
> >>> 'to_dst_file' is closed before calling close_return_path_on_source()
> >>> and the shutdown fails, leaving the source and destination waiting for
> >>> an event to occur.
> >> 
> >> At close_return_path_on_source, qemu_file_shutdown() and checking
> >> ms->to_dst_file are done under the qemu_file_lock, so how could
> >> migrate_fd_cleanup() have cleared the pointer but the ms->to_dst_file
> >> check have passed?
> >
> > This is not a locking issue, it's much simpler. migrate_fd_cleanup()
> > clears the ms->to_dst_file pointer and closes the QEMUFile and then
> > calls close_return_path_on_source() which then tries to use resources
> > which are not available anymore.
> 
> I'm missing something here. Which resources? I assume you're talking
> about this:
> 
>     WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
>         if (ms->to_dst_file && ms->rp_state.from_dst_file &&
>             qemu_file_get_error(ms->to_dst_file)) {
>             qemu_file_shutdown(ms->rp_state.from_dst_file);
>         }
>     }
> 
> How do we get past the 'if (ms->to_dst_file)'?

We don't; migrate_fd_cleanup() will release ms->to_dst_file, then call
close_return_path_on_source(), found that to_dst_file==NULL and then skip
the shutdown().

One other option might be that we do close_return_path_on_source() before
the chunk of releasing to_dst_file.

This "two qemufiles share the same ioc" issue had bitten us before IIRC,
and the only concern of that workaround is we keep postponing resolution of
the real issue, then we keep getting bitten by it..

Maybe we can wait a few days to see if Dan can join the conversation and if
we can reach a consensus on a complete solution.  Otherwise I think we can
still work this around, but maybe that'll require a comment block
explaining the bits after such movement.

Thanks,

-- 
Peter Xu


Reply via email to