Hello Peter,

Today, close_return_path_on_source() can perform a shutdown to exit
the return-path thread if an error occured. However, migrate_fd_cleanup()
does cleanups too early and the shutdown in close_return_path_on_source()
fails, leaving the source and destination waiting for an event to occur.

This little series tries to fix that. Comments welcome !

One thing I do agree is that relying on qemu_file_get_error(to_dst_file) in
close_return_path_on_source() is weird: IMHO we have better way to detect
"whether the migration has error" now, which is migrate_has_error().

ok. migrate_has_error() looks safe to use in that case. It works fine
with all the prereq VFIO cleanups (that I didn't send yet) and errors
in the setup of dirty tracking are reported correctly to the migration
core subsystem.

For this specific issue, I think one long standing issue that might be
relevant is we have two QEMUFile (from_dst_file, to_dst_file) that share
the same QIOChannel now.  Logically the two QEMUFile should be able to be
managed separately, say, close() of to_dst_file shouldn't affect the other.

However I don't think it's the case now, as qemu_fclose(to_dst_file) will
do qio_channel_close() already, which means there will be a side effect to
the other QEMUFile that its backing IOC is already closed.

Is this the issue we're facing?

Yes. The socket is closed before calling close_return_path_on_source()
and ms->rp_state.from_dst_file becomes invalid, the shutdown silently
fails (we should maybe report error in qemu_file_shutdown()) and the
return-path thread does not exits.

IOW, the close() of to_dst_file will not
properly kick the other thread who is blocked at reading from_dst_file,
while the shutdown() will kick it out?

Yes, that's how I understand the comment :

    /*
     * If this is a normal exit then the destination will send a SHUT
     * and the rp_thread will exit, however if there's an error we
     * need to cause it to exit. shutdown(2), if we have it, will
     * cause it to unblock if it's stuck waiting for the destination.
     */

If so, not sure whether we can somehow relay the real qio_channel_close()
to until the last user releases it? IOW, conditionally close() the channel> in 
qio_channel_finalize(), if the channel is still open?  Would that make
sense?
It's the first time that I look at this code :/ I can't tell. Here is
the closing section :

        qemu_mutex_unlock(&s->qemu_file_lock);
        /*
         * Close the file handle without the lock to make sure the
         * critical section won't block for long.
         */
        migration_ioc_unregister_yank_from_file(tmp);
        qemu_fclose(tmp);
    }


Thanks,

C.


Reply via email to