vas: Migration suspend waits for no in-progress open windows

Haren Myneni Fri, 13 Oct 2023 13:27:17 -0700



On 10/11/23 1:36 PM, Nathan Lynch wrote:

Haren Myneni <ha...@linux.ibm.com> writes:

Haren Myneni <ha...@linux.ibm.com> writes:

The hypervisor returns migration failure if all VAS windows are not
closed. During pre-migration stage, vas_migration_handler() sets
migration_in_progress flag and closes all windows from the list.
The allocate VAS window routine checks the migration flag, setup
the window and then add it to the list. So there is possibility of
the migration handler missing the window that is still in the
process of setup.

t1: Allocate and open VAS       t2: Migration event
      window

lock vas_pseries_mutex
If migration_in_progress set
    unlock vas_pseries_mutex
    return
open window HCALL
unlock vas_pseries_mutex
Modify window HCALL             lock vas_pseries_mutex
setup window                    migration_in_progress=true
                                Closes all windows from
                                the list
                                unlock vas_pseries_mutex
lock vas_pseries_mutex          return
if nr_closed_windows == 0
    // No DLPAR CPU or migration
    add to the list
    unlock vas_pseries_mutex
    return
unlock vas_pseries_mutex
Close VAS window
// due to DLPAR CPU or migration
return -EBUSY


Could the the path t1 takes simply hold the mutex for the duration of
its execution instead of dropping and reacquiring it in the middle?

Here's the relevant code from vas_allocate_window():

        mutex_lock(&vas_pseries_mutex);
        if (migration_in_progress)
                rc = -EBUSY;
        else
                rc = allocate_setup_window(txwin, (u64 *)&domain[0],
                                   cop_feat_caps->win_type);
        mutex_unlock(&vas_pseries_mutex);
        if (rc)
                goto out;

        rc = h_modify_vas_window(txwin);
        if (!rc)
                rc = get_vas_user_win_ref(&txwin->vas_win.task_ref);
        if (rc)
                goto out_free;

        txwin->win_type = cop_feat_caps->win_type;
        mutex_lock(&vas_pseries_mutex);
        if (!caps->nr_close_wins) {
                list_add(&txwin->win_list, &caps->list);
                caps->nr_open_windows++;
                mutex_unlock(&vas_pseries_mutex);
                vas_user_win_add_mm_context(&txwin->vas_win.task_ref);
                return &txwin->vas_win;
        }
        mutex_unlock(&vas_pseries_mutex);

Is there something about h_modify_vas_window() or get_vas_user_win_ref()
that requires temporarily dropping the lock?


Thanks Nathan for your comments.

vas_pseries_mutex protects window ID and IRQ allocation between alloc
and free window HCALLs, and window list. Generally try to not using
mutex in HCALLs, but we need this mutex with these HCALLs.

We can add h_modify_vas_window() or get_vas_user_win_ref() with in the
mutex context, but not needed.


Hmm. I contend that it would fix your bug in a simpler way that
eliminates the race instead of coping with it by adding more state and
complicating the locking model. With your change, readers of the
migration_in_progress flag check it under the mutex, but the writer
updates it outside of the mutex, which seems strange and unlikely to be
correct.

The migration thread is the only writer which changesmigration_in_progress flag. The setting this flag in moved outside ofmutex in this patch. The window open is only reader of this flag butwithin mutex.


Reason for moved the setting outside of mutex:

Suppose many threads are called open window and waiting on mutex andlater the migration thread started. In this case the migration threadhas to wait on mutex for all window open threads has to complete openwindow HCALLs in the hypervisor. Then the migration thread has to closeall these windows immediately. So if the setting is done outside ofmutex, the later open window threads can exit from this function quicklywithout opening windows.

Reason for keeping the migration_in_progress check inside of mutexsection with the above change (setting outside of mutex):

If the reader threads waits on mutex after checking this flag (beforeholding mutex), end up opening windows which will be closed anyway bythe migration thread. Also the later open threads can return with -EBUSYquickly.

Also free HCALL can take longer depends
on pending NX requests since the hypervisor waits for all requests to be
completed before closing the window.

Applications can issue window open / free calls at the same time which
can experience mutex contention especially on the large system where
100's of credits are available. Another example: The migration event can
wait longer (or timeout) to get this mutex if many threads issue
open/free window calls. Hence added h_modify_vas_window() (modify HCALL)
or get_vas_user_win_ref() outside of mutex.


OK. I believe you're referring to this code, which can run under the lock:

static long hcall_return_busy_check(long rc)
{
        /* Check if we are stalled for some time */
        if (H_IS_LONG_BUSY(rc)) {
                msleep(get_longbusy_msecs(rc));
                rc = H_BUSY;
        } else if (rc == H_BUSY) {
                cond_resched();
        }

        return rc;
}

...

static int h_deallocate_vas_window(u64 winid)
{
        long rc;

        do {
                rc = plpar_hcall_norets(H_DEALLOCATE_VAS_WINDOW, winid);

                rc = hcall_return_busy_check(rc);
        } while (rc == H_BUSY);

...

[ with similar loops in the window allocate and modify functions ]

If get_longbusy_msecs() typically returns low values (<20), then you
should prefer usleep_range() over msleep() to avoid over-sleeping. For
example, msleep(1) can schedule away for much longer than 1ms -- often
more like 20ms.

If mutex hold times have been a problem in this code, then it's probably
worth improving the way it handles the busy/retry statuses under the
lock.

Not only mutex hold time longer for contention, also affects theapplication performance if we have large mutex section. The applicationthread can make several window open and close calls. In the worst caseit can be for each NX request and someone have similar workload on BMLfor performance considerations (if not distribute windows on all VASengines). In these kind of scenarios when multiple threads are openingwindows, there could be performance bottleneck on mutex if we have largemutex section. So to have the flexibility, created two sections one withallocate window (ID and IRQ) and the next one is to protect the windowlist.


Thanks
Haren

Re: [PATCH] powerpc/pseries/vas: Migration suspend waits for no in-progress open windows

Reply via email to