On 9/9/23 08:28, Tim Blechmann wrote:
>>>>>> when we cannot get a tx buffer (`get_a_tx_buf`) `rpmsg_upref_sleepers`
>>>>>> enables tx-complete interrupt.
>>>>>> however if the interrupt is executed after `get_a_tx_buf` and before
>>>>>> `rpmsg_upref_sleepers` we may mis the tx-complete interrupt and sleep
>>>>>> for the full 15 seconds.
>>>>>
>>>>>
>>>>> Is there any reason why your co-processor is unable to release the TX 
>>>>> RPMSG
>>>>> buffers for 15 seconds? If not, you should first determine the reason why
>>>>> it is
>>>>> stalled.
>>>>
>>>> Arnaud's concern is valid.  If the remote processor can't consume a buffer
>>>> within 15 seconds, something is probably wrong.
>>>>
>>>> That said, I believe your assesment of the situation is correct.  *If* the 
>>>> TX
>>>> callback is disabled and there is no buffer available, there is a window of
>>>> opportunity between calls to get_a_tx_buf() and rpmsg_upref_sleepers() for 
>>>> an
>>>> interrupt to arrive in function rpmsg_send_offchannel_raw().
>>>
>>> the remote processor certainly releases the tx buffer and according to my
>>> tracing the `vring_interrupt` fires immediately before `rpmsg_send` enters 
>>> the
>>> `rpmsg_upref_sleepers`.
>>
>>
>> If I well understood your point, the issue occur in following race condition
>>
>> - all the Tx buffers are used
>> - in rpmsg_send_offchannel_raw() function, we try to get a buffer using
>> get_a_tx_buf(vrp) that returns NULL
>> - rpmsg_xmit_done is called as a Tx buffer is released by the remote 
>> processor
>>    and now free
>> - in rpmsg_send_offchannel_raw() rpmsg_upref_sleepers is called
>>
>> At this point you are nothing happen until 15 second because rpmsg_xmit_done 
>> is
>> never called  again that would wake up the waitqueue to call get_a_tx_buf()
> 
> yes, this is the behaviour that i've been seeing:
> 
> virtio emits a few times:
> [  404.XXXXXX] No more buffers in queue
> and then
> [  404.588722] remoteproc remoteproc0: vq index 1 is interrupted
> [  404.597249] virtqueue interrupt with no work for 6d53f13a

Difficult to understand the scheduling with your trace. Could you enable ftrace 
to
observe it (mailbox interrupts,rpmsg and virtio functions)
> 
>> If yes what is not clear to me is that wait_event_interruptible_timeout() 
>> seems
>> to test the condition (so call get_a_tx_buf()) before entering in sleep[1]. A
>> free TX buffer should be found at this step.
>>
>> [1]https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L534
> 
> hmm, interesting. but that would imply a timing issue where calling 
> get_a_tx_buf
> twice somehow makes a difference as opposed to calling it only once.
> 
> would the "interrupt with no work" point to a different cause for the issue 
> that
> we are seeing?

Look like you have not released the buffer on remote side so no buffer in vring
used list.

Do you use the zero copy (rpmsg_hold_rx_buffer) on remote processor?
If yes, please check that you have following fix:
https://github.com/OpenAMP/open-amp/commit/2b6a38a9c6890869960ed41d6a1dc9930c079e0d

> 
>>
>> Regards,
>> Arnaud
>>
>>>
>>> after applying this patch we haven't been able to reproduce the 15s timeout
>>> anymore, whereas before we could easily reproduce it with certain workloads.
>>>
>>>> 3) This patch gets applied when rc1 comes out so that it has 6 or 7 weeks 
>>>> to
>>>> soak.  No error are locks are reported due to this patch during that time.
>>>
>>> mentioning locks: i was a bit uncertain about a good way to implement the 
>>> retry,
>>> since both `rpmsg_upref_sleepers` and `get_a_tx_buf` both acquire the same
>>> mutex. i briefly considered to add `get_a_tx_buf` into 
>>> `rpmsg_upref_sleepers` to
>>> avoid locking the same mutex multiple times, though it adds a bit of 
>>> complexity
>>> to the implementation and harms readability a bit.
>>> are there any recommendations on this topic or are (likely non-contended) 
>>> locks
>>> not expensive enough to justify the added complexity?
>>>
>>> thanks,
>>> tim
>>>
>>>
>>>>
>>>>>
>>>>> Regards,
>>>>> Arnaud
>>>>>
>>>>>>
>>>>>> in this case, so we re-try once before we really start to sleep
>>>>>>
>>>>>> Signed-off-by: Tim Blechmann <[email protected]>
>>>>>> ---
>>>>>>    drivers/rpmsg/virtio_rpmsg_bus.c | 24 +++++++++++++++---------
>>>>>>    1 file changed, 15 insertions(+), 9 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c
>>>>>> b/drivers/rpmsg/virtio_rpmsg_bus.c
>>>>>> index 905ac7910c98..2a9d42225e60 100644
>>>>>> --- a/drivers/rpmsg/virtio_rpmsg_bus.c
>>>>>> +++ b/drivers/rpmsg/virtio_rpmsg_bus.c
>>>>>> @@ -587,21 +587,27 @@ static int rpmsg_send_offchannel_raw(struct
>>>>>> rpmsg_device *rpdev,
>>>>>>          /* no free buffer ? wait for one (but bail after 15 seconds) */
>>>>>>        while (!msg) {
>>>>>>            /* enable "tx-complete" interrupts, if not already enabled */
>>>>>>            rpmsg_upref_sleepers(vrp);
>>>>>>    -        /*
>>>>>> -         * sleep until a free buffer is available or 15 secs elapse.
>>>>>> -         * the timeout period is not configurable because there's
>>>>>> -         * little point in asking drivers to specify that.
>>>>>> -         * if later this happens to be required, it'd be easy to add.
>>>>>> -         */
>>>>>> -        err = wait_event_interruptible_timeout(vrp->sendq,
>>>>>> -                    (msg = get_a_tx_buf(vrp)),
>>>>>> -                    msecs_to_jiffies(15000));
>>>>>> +        /* make sure to retry to grab tx buffer before we start waiting 
>>>>>> */
>>>>>> +        msg = get_a_tx_buf(vrp);
>>>>>> +        if (msg) {
>>>>>> +            err = 0;
>>>>>> +        } else {
>>>>>> +            /*
>>>>>> +             * sleep until a free buffer is available or 15 secs elapse.
>>>>>> +             * the timeout period is not configurable because there's
>>>>>> +             * little point in asking drivers to specify that.
>>>>>> +             * if later this happens to be required, it'd be easy to 
>>>>>> add.
>>>>>> +             */
>>>>>> +            err = wait_event_interruptible_timeout(vrp->sendq,
>>>>>> +                        (msg = get_a_tx_buf(vrp)),
>>>>>> +                        msecs_to_jiffies(15000));
>>>>>> +        }
>>>>>>              /* disable "tx-complete" interrupts if we're the last 
>>>>>> sleeper */
>>>>>>            rpmsg_downref_sleepers(vrp);
>>>>>>              /* timeout ? */
>>>>>>            if (!err) {
>>>>
>>>
>>

Reply via email to