I see two issues:

1.)    The synchronous calls won't execute the progress function once the they 
decide to sleep, so FI_PROGRESS_MANUAL does not work. This is absolutely true 
in util_cq, but not completely so in util_cntr since it sleeps in 50 ms 
intervals, but the 50 ms would be an unacceptable latency in any real use. 
Using the cntr wait loop as an example:

do {
                cntr->progress(cntr);
                if (threshold <= ofi_atomic_get64(&cntr->cnt))
                        return FI_SUCCESS;

                if (errcnt != ofi_atomic_get64(&cntr->err))
                        return -FI_EAVAIL;

                if (ofi_adjust_timeout(endtime, &timeout))
                        return -FI_ETIMEDOUT;

                /*
                 * Temporary work-around to avoid a thread hanging in underlying
                 * epoll_wait called from fi_wait. This can happen if one thread
                 * updates the counter, another thread reads it (thereby 
resetting
                 * cntr signal fd) and the current thread is about to wait. The
                 * current thread would never wake up and doesn't know the 
counter
                 * has been updated. Fix it by checking counter state every now
                 * and then instead of waiting for a longer period. This does
                 * have the overhead of threads waking up unnecessarily.
                 */
                 */
                timeout_quantum = (timeout < 0 ? OFI_TIMEOUT_QUANTUM_MS :
                                   MIN(OFI_TIMEOUT_QUANTUM_MS, timeout));

                ret = fi_wait(&cntr->wait->wait_fid, timeout_quantum);
     } while (!ret || (ret == -FI_ETIMEDOUT &&
              (timeout < 0 || timeout_quantum < timeout)));

Setting the timeout_quantum to 0 in the case cntr->domain->data_progress == 
FI_PROGRESS_MANUAL seems to be a quick fix, but it might be wiser to allow a 
wait function to be specified as part of the util_cntr/cq and let it deal with 
this.


2.)    ofi_cq_read() only calls the progress function if the cq is empty or the 
count is zero.


        if (ofi_cirque_isempty(cq->cirq) || !count) {
                cq->cq_fastlock_release(&cq->cq_lock);
                cq->progress(cq);
                cq->cq_fastlock_acquire(&cq->cq_lock);
                if (ofi_cirque_isempty(cq->cirq)) {
                        i = -FI_EAGAIN;
                        goto out;
                }
        }

While I actually like idea of trying to reduce gratuitous progress calls,  this 
caused some code I had to break. I was doing a fi_recv(FI_PEEK) and getting 
ENOMSG returned immediately in the cq, because the queue was never empty, the 
progress function was never called and my loop hung indefinitely. I believe the 
progress function must be called unconditionally.

The util_cntr code does not have this issue, progress is called unconditionally 
in ofi_cntr_read().


John Byrne







_______________________________________________
ofiwg mailing list
[email protected]
https://lists.openfabrics.org/mailman/listinfo/ofiwg

Reply via email to