On 9/9/23 08:28, Tim Blechmann wrote:
>>>>>> when we cannot get a tx buffer (`get_a_tx_buf`) `rpmsg_upref_sleepers`
>>>>>> enables tx-complete interrupt.
>>>>>> however if the interrupt is executed after `get_a_tx_buf` and before
>>>>>> `rpmsg_upref_sleepers` we may mis the tx-complete interrupt and sleep
>>>>>> for the full 15 seconds.
>>>>>
>>>>>
>>>>> Is there any reason why your co-processor is unable to release the TX
>>>>> RPMSG
>>>>> buffers for 15 seconds? If not, you should first determine the reason why
>>>>> it is
>>>>> stalled.
>>>>
>>>> Arnaud's concern is valid. If the remote processor can't consume a buffer
>>>> within 15 seconds, something is probably wrong.
>>>>
>>>> That said, I believe your assesment of the situation is correct. *If* the
>>>> TX
>>>> callback is disabled and there is no buffer available, there is a window of
>>>> opportunity between calls to get_a_tx_buf() and rpmsg_upref_sleepers() for
>>>> an
>>>> interrupt to arrive in function rpmsg_send_offchannel_raw().
>>>
>>> the remote processor certainly releases the tx buffer and according to my
>>> tracing the `vring_interrupt` fires immediately before `rpmsg_send` enters
>>> the
>>> `rpmsg_upref_sleepers`.
>>
>>
>> If I well understood your point, the issue occur in following race condition
>>
>> - all the Tx buffers are used
>> - in rpmsg_send_offchannel_raw() function, we try to get a buffer using
>> get_a_tx_buf(vrp) that returns NULL
>> - rpmsg_xmit_done is called as a Tx buffer is released by the remote
>> processor
>> and now free
>> - in rpmsg_send_offchannel_raw() rpmsg_upref_sleepers is called
>>
>> At this point you are nothing happen until 15 second because rpmsg_xmit_done
>> is
>> never called again that would wake up the waitqueue to call get_a_tx_buf()
>
> yes, this is the behaviour that i've been seeing:
>
> virtio emits a few times:
> [ 404.XXXXXX] No more buffers in queue
> and then
> [ 404.588722] remoteproc remoteproc0: vq index 1 is interrupted
> [ 404.597249] virtqueue interrupt with no work for 6d53f13a
Difficult to understand the scheduling with your trace. Could you enable ftrace
to
observe it (mailbox interrupts,rpmsg and virtio functions)
>
>> If yes what is not clear to me is that wait_event_interruptible_timeout()
>> seems
>> to test the condition (so call get_a_tx_buf()) before entering in sleep[1]. A
>> free TX buffer should be found at this step.
>>
>> [1]https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L534
>
> hmm, interesting. but that would imply a timing issue where calling
> get_a_tx_buf
> twice somehow makes a difference as opposed to calling it only once.
>
> would the "interrupt with no work" point to a different cause for the issue
> that
> we are seeing?
Look like you have not released the buffer on remote side so no buffer in vring
used list.
Do you use the zero copy (rpmsg_hold_rx_buffer) on remote processor?
If yes, please check that you have following fix:
https://github.com/OpenAMP/open-amp/commit/2b6a38a9c6890869960ed41d6a1dc9930c079e0d
>
>>
>> Regards,
>> Arnaud
>>
>>>
>>> after applying this patch we haven't been able to reproduce the 15s timeout
>>> anymore, whereas before we could easily reproduce it with certain workloads.
>>>
>>>> 3) This patch gets applied when rc1 comes out so that it has 6 or 7 weeks
>>>> to
>>>> soak. No error are locks are reported due to this patch during that time.
>>>
>>> mentioning locks: i was a bit uncertain about a good way to implement the
>>> retry,
>>> since both `rpmsg_upref_sleepers` and `get_a_tx_buf` both acquire the same
>>> mutex. i briefly considered to add `get_a_tx_buf` into
>>> `rpmsg_upref_sleepers` to
>>> avoid locking the same mutex multiple times, though it adds a bit of
>>> complexity
>>> to the implementation and harms readability a bit.
>>> are there any recommendations on this topic or are (likely non-contended)
>>> locks
>>> not expensive enough to justify the added complexity?
>>>
>>> thanks,
>>> tim
>>>
>>>
>>>>
>>>>>
>>>>> Regards,
>>>>> Arnaud
>>>>>
>>>>>>
>>>>>> in this case, so we re-try once before we really start to sleep
>>>>>>
>>>>>> Signed-off-by: Tim Blechmann <[email protected]>
>>>>>> ---
>>>>>> drivers/rpmsg/virtio_rpmsg_bus.c | 24 +++++++++++++++---------
>>>>>> 1 file changed, 15 insertions(+), 9 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c
>>>>>> b/drivers/rpmsg/virtio_rpmsg_bus.c
>>>>>> index 905ac7910c98..2a9d42225e60 100644
>>>>>> --- a/drivers/rpmsg/virtio_rpmsg_bus.c
>>>>>> +++ b/drivers/rpmsg/virtio_rpmsg_bus.c
>>>>>> @@ -587,21 +587,27 @@ static int rpmsg_send_offchannel_raw(struct
>>>>>> rpmsg_device *rpdev,
>>>>>> /* no free buffer ? wait for one (but bail after 15 seconds) */
>>>>>> while (!msg) {
>>>>>> /* enable "tx-complete" interrupts, if not already enabled */
>>>>>> rpmsg_upref_sleepers(vrp);
>>>>>> - /*
>>>>>> - * sleep until a free buffer is available or 15 secs elapse.
>>>>>> - * the timeout period is not configurable because there's
>>>>>> - * little point in asking drivers to specify that.
>>>>>> - * if later this happens to be required, it'd be easy to add.
>>>>>> - */
>>>>>> - err = wait_event_interruptible_timeout(vrp->sendq,
>>>>>> - (msg = get_a_tx_buf(vrp)),
>>>>>> - msecs_to_jiffies(15000));
>>>>>> + /* make sure to retry to grab tx buffer before we start waiting
>>>>>> */
>>>>>> + msg = get_a_tx_buf(vrp);
>>>>>> + if (msg) {
>>>>>> + err = 0;
>>>>>> + } else {
>>>>>> + /*
>>>>>> + * sleep until a free buffer is available or 15 secs elapse.
>>>>>> + * the timeout period is not configurable because there's
>>>>>> + * little point in asking drivers to specify that.
>>>>>> + * if later this happens to be required, it'd be easy to
>>>>>> add.
>>>>>> + */
>>>>>> + err = wait_event_interruptible_timeout(vrp->sendq,
>>>>>> + (msg = get_a_tx_buf(vrp)),
>>>>>> + msecs_to_jiffies(15000));
>>>>>> + }
>>>>>> /* disable "tx-complete" interrupts if we're the last
>>>>>> sleeper */
>>>>>> rpmsg_downref_sleepers(vrp);
>>>>>> /* timeout ? */
>>>>>> if (!err) {
>>>>
>>>
>>