Sorry, "you must do..." may confuse you. What I mean is the CAN driver.

On Wed, Jan 4, 2023 at 2:30 AM Carlos Sanchez
<carlossanc...@geotab.com.invalid> wrote:

> Hi Xiang,
>
> Please note what I describe is not caused by my code using multiple
> threads, but is happening on Nuttx upstream. My code is single threaded,
> but s32k1xx_flexcan driver (and several other Socket CAN drivers as they
> all seem to be derived from the same code base) does some things on the
> thread I call write() from, and some other things on CANWORK work queue
> thread.
> My understanding is that net code is structured so work queue threads are
> used, generally, but in Socket CAN drivers this was "waived" to avoid data
> loss, causing the problem I describe.
>
> Thanks,
>
> Carlos
>
> On Tue, Jan 3, 2023 at 6:56 PM Xiang Xiao <xiaoxiang781...@gmail.com>
> wrote:
>
> > Since tx/rx share the same d_len/d_buf, you must do send/recv in one and
> > only thread(either by system work thread or driver dedicated thread) to
> > avoid the race condition you describe below.
> >
> > On Wed, Jan 4, 2023 at 1:45 AM Carlos Sanchez
> > <carlossanc...@geotab.com.invalid> wrote:
> >
> > > Hi all,
> > >
> > > I am observing an extrange behavior: under heavy-error CAN TX scenario
> > (no
> > > acks so TX fails always), usually after the second call to write() my
> > > writes fail. This is expected as s32k1xx_flexcan has two TX mailboxes
> and
> > > from my understanding of the code there is no other buffering (on this
> > > specific CAN driver at least).
> > >
> > > However, if I enable CAN errors, and depending on runtime sync
> conditions
> > > (basically, if I put a breakpoint on s32k1xx_txpoll) then all the
> writes
> > > after the first silently fail, without really trying to send anything
> (I
> > > see on ESR2 register than second TX mailbox does not really become
> > active).
> > >
> > > After some debugging, I have seen that the CANWORK=LPWORK thread has
> > > scheduled calls to s32k1xx_error_work, overwrites d_buf/d_len to send
> the
> > > error frames in. But TX polling does not always happen in CANWORK
> thread:
> > > it does when it comes from s32k1xx_txdone_work, but not when it comes
> > from
> > > s32k1xx_txavail_work, which, despite the name, is called directly on
> > > s32k1xx_txavail context (which is the application context). What is
> > > happening in my case, is txavail_work->...->devif_poll->can_poll is
> > setting
> > > d_buf/d_len to the packed to be sent, but before s32k1xx_txpoll checks
> it
> > > (due to my breakpoint), s32k1xx_error kicks in, "steals" d_buf/d_len to
> > > setup the error frame and calls can_input. The frame which was set up
> by
> > > the polling sequence gets silently discarded.
> > >
> > > I have tried setting s32k1xx_txavail_work inside CANWORK, but this
> fails
> > > because can_sendmsg checks immediately for non-blocking writes; placing
> > the
> > > polling in the work queue would cause all non-blocking writes to fail.
> > >
> > > Related to this, I also fail to see the arbitration between TX/RX for
> > > d_buf/d_len. From what I see, the same problem I am describing could
> > happen
> > > by s32k1xx_receive "stealing" d_buf/d_len, same as s32k1xx_error is
> doing
> > > in my case. But this is only a thought, I have not observed it.
> > >
> > > A possible clean solution is to use another buffer, but it is complex
> and
> > > would mean losing the direct connection between the write and the HW TX
> > > (which might be useful in general and it is for my use case). A quicker
> > > solution would be for s32k1xx_error to lock the network, forcing it to
> > wait
> > > until txavail_work is done. This would solve my case. My second concern
> > is
> > > more difficult to solve as comments in the code explicitly say RX
> cannot
> > be
> > > delayed to the work queue or CAN frames would be lost.
> > >
> > > Any ideas or anything I might be missing here?
> > >
> > > Thanks,
> > >
> > > Carlos
> > >
> > > --
> > >
> > > Carlos Sanchez (he, him, his)
> > > Geotab
> > >
> > > Embedded Systems Developer Team Lead | Europe
> > >
> > > Visit
> > >
> > > www.geotab.com
> > >
> > > Twitter <https://twitter.com/geotab> | Facebook
> > > <https://www.facebook.com/Geotab> | YouTube
> > > <https://www.youtube.com/user/MyGeotab> | LinkedIn
> > > <https://www.linkedin.com/company/geotab/>
> > >
> >
>
>
> --
>
> Carlos Sanchez (he, him, his)
> Geotab
>
> Embedded Systems Developer Team Lead | Europe
>
> Visit
>
> www.geotab.com
>
> Twitter <https://twitter.com/geotab> | Facebook
> <https://www.facebook.com/Geotab> | YouTube
> <https://www.youtube.com/user/MyGeotab> | LinkedIn
> <https://www.linkedin.com/company/geotab/>
>

Reply via email to