Since tx/rx share the same d_len/d_buf, you must do send/recv in one and
only thread(either by system work thread or driver dedicated thread) to
avoid the race condition you describe below.

On Wed, Jan 4, 2023 at 1:45 AM Carlos Sanchez
<carlossanc...@geotab.com.invalid> wrote:

> Hi all,
>
> I am observing an extrange behavior: under heavy-error CAN TX scenario (no
> acks so TX fails always), usually after the second call to write() my
> writes fail. This is expected as s32k1xx_flexcan has two TX mailboxes and
> from my understanding of the code there is no other buffering (on this
> specific CAN driver at least).
>
> However, if I enable CAN errors, and depending on runtime sync conditions
> (basically, if I put a breakpoint on s32k1xx_txpoll) then all the writes
> after the first silently fail, without really trying to send anything (I
> see on ESR2 register than second TX mailbox does not really become active).
>
> After some debugging, I have seen that the CANWORK=LPWORK thread has
> scheduled calls to s32k1xx_error_work, overwrites d_buf/d_len to send the
> error frames in. But TX polling does not always happen in CANWORK thread:
> it does when it comes from s32k1xx_txdone_work, but not when it comes from
> s32k1xx_txavail_work, which, despite the name, is called directly on
> s32k1xx_txavail context (which is the application context). What is
> happening in my case, is txavail_work->...->devif_poll->can_poll is setting
> d_buf/d_len to the packed to be sent, but before s32k1xx_txpoll checks it
> (due to my breakpoint), s32k1xx_error kicks in, "steals" d_buf/d_len to
> setup the error frame and calls can_input. The frame which was set up by
> the polling sequence gets silently discarded.
>
> I have tried setting s32k1xx_txavail_work inside CANWORK, but this fails
> because can_sendmsg checks immediately for non-blocking writes; placing the
> polling in the work queue would cause all non-blocking writes to fail.
>
> Related to this, I also fail to see the arbitration between TX/RX for
> d_buf/d_len. From what I see, the same problem I am describing could happen
> by s32k1xx_receive "stealing" d_buf/d_len, same as s32k1xx_error is doing
> in my case. But this is only a thought, I have not observed it.
>
> A possible clean solution is to use another buffer, but it is complex and
> would mean losing the direct connection between the write and the HW TX
> (which might be useful in general and it is for my use case). A quicker
> solution would be for s32k1xx_error to lock the network, forcing it to wait
> until txavail_work is done. This would solve my case. My second concern is
> more difficult to solve as comments in the code explicitly say RX cannot be
> delayed to the work queue or CAN frames would be lost.
>
> Any ideas or anything I might be missing here?
>
> Thanks,
>
> Carlos
>
> --
>
> Carlos Sanchez (he, him, his)
> Geotab
>
> Embedded Systems Developer Team Lead | Europe
>
> Visit
>
> www.geotab.com
>
> Twitter <https://twitter.com/geotab> | Facebook
> <https://www.facebook.com/Geotab> | YouTube
> <https://www.youtube.com/user/MyGeotab> | LinkedIn
> <https://www.linkedin.com/company/geotab/>
>

Reply via email to