Related issue: https://github.com/apache/nuttx/issues/5142.
Not using a work queue to handle CAN RX is basically wrong. Unfortunately
all NXP SocketCAN drivers are affected.

wt., 3 sty 2023 o 19:38 Xiang Xiao <xiaoxiang781...@gmail.com> napisaƂ(a):

>   Sorry, "you must do..." may confuse you. What I mean is the CAN driver.
>
> On Wed, Jan 4, 2023 at 2:30 AM Carlos Sanchez
> <carlossanc...@geotab.com.invalid> wrote:
>
> > Hi Xiang,
> >
> > Please note what I describe is not caused by my code using multiple
> > threads, but is happening on Nuttx upstream. My code is single threaded,
> > but s32k1xx_flexcan driver (and several other Socket CAN drivers as they
> > all seem to be derived from the same code base) does some things on the
> > thread I call write() from, and some other things on CANWORK work queue
> > thread.
> > My understanding is that net code is structured so work queue threads are
> > used, generally, but in Socket CAN drivers this was "waived" to avoid
> data
> > loss, causing the problem I describe.
> >
> > Thanks,
> >
> > Carlos
> >
> > On Tue, Jan 3, 2023 at 6:56 PM Xiang Xiao <xiaoxiang781...@gmail.com>
> > wrote:
> >
> > > Since tx/rx share the same d_len/d_buf, you must do send/recv in one
> and
> > > only thread(either by system work thread or driver dedicated thread) to
> > > avoid the race condition you describe below.
> > >
> > > On Wed, Jan 4, 2023 at 1:45 AM Carlos Sanchez
> > > <carlossanc...@geotab.com.invalid> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I am observing an extrange behavior: under heavy-error CAN TX
> scenario
> > > (no
> > > > acks so TX fails always), usually after the second call to write() my
> > > > writes fail. This is expected as s32k1xx_flexcan has two TX mailboxes
> > and
> > > > from my understanding of the code there is no other buffering (on
> this
> > > > specific CAN driver at least).
> > > >
> > > > However, if I enable CAN errors, and depending on runtime sync
> > conditions
> > > > (basically, if I put a breakpoint on s32k1xx_txpoll) then all the
> > writes
> > > > after the first silently fail, without really trying to send anything
> > (I
> > > > see on ESR2 register than second TX mailbox does not really become
> > > active).
> > > >
> > > > After some debugging, I have seen that the CANWORK=LPWORK thread has
> > > > scheduled calls to s32k1xx_error_work, overwrites d_buf/d_len to send
> > the
> > > > error frames in. But TX polling does not always happen in CANWORK
> > thread:
> > > > it does when it comes from s32k1xx_txdone_work, but not when it comes
> > > from
> > > > s32k1xx_txavail_work, which, despite the name, is called directly on
> > > > s32k1xx_txavail context (which is the application context). What is
> > > > happening in my case, is txavail_work->...->devif_poll->can_poll is
> > > setting
> > > > d_buf/d_len to the packed to be sent, but before s32k1xx_txpoll
> checks
> > it
> > > > (due to my breakpoint), s32k1xx_error kicks in, "steals" d_buf/d_len
> to
> > > > setup the error frame and calls can_input. The frame which was set up
> > by
> > > > the polling sequence gets silently discarded.
> > > >
> > > > I have tried setting s32k1xx_txavail_work inside CANWORK, but this
> > fails
> > > > because can_sendmsg checks immediately for non-blocking writes;
> placing
> > > the
> > > > polling in the work queue would cause all non-blocking writes to
> fail.
> > > >
> > > > Related to this, I also fail to see the arbitration between TX/RX for
> > > > d_buf/d_len. From what I see, the same problem I am describing could
> > > happen
> > > > by s32k1xx_receive "stealing" d_buf/d_len, same as s32k1xx_error is
> > doing
> > > > in my case. But this is only a thought, I have not observed it.
> > > >
> > > > A possible clean solution is to use another buffer, but it is complex
> > and
> > > > would mean losing the direct connection between the write and the HW
> TX
> > > > (which might be useful in general and it is for my use case). A
> quicker
> > > > solution would be for s32k1xx_error to lock the network, forcing it
> to
> > > wait
> > > > until txavail_work is done. This would solve my case. My second
> concern
> > > is
> > > > more difficult to solve as comments in the code explicitly say RX
> > cannot
> > > be
> > > > delayed to the work queue or CAN frames would be lost.
> > > >
> > > > Any ideas or anything I might be missing here?
> > > >
> > > > Thanks,
> > > >
> > > > Carlos
> > > >
> > > > --
> > > >
> > > > Carlos Sanchez (he, him, his)
> > > > Geotab
> > > >
> > > > Embedded Systems Developer Team Lead | Europe
> > > >
> > > > Visit
> > > >
> > > > www.geotab.com
> > > >
> > > > Twitter <https://twitter.com/geotab> | Facebook
> > > > <https://www.facebook.com/Geotab> | YouTube
> > > > <https://www.youtube.com/user/MyGeotab> | LinkedIn
> > > > <https://www.linkedin.com/company/geotab/>
> > > >
> > >
> >
> >
> > --
> >
> > Carlos Sanchez (he, him, his)
> > Geotab
> >
> > Embedded Systems Developer Team Lead | Europe
> >
> > Visit
> >
> > www.geotab.com
> >
> > Twitter <https://twitter.com/geotab> | Facebook
> > <https://www.facebook.com/Geotab> | YouTube
> > <https://www.youtube.com/user/MyGeotab> | LinkedIn
> > <https://www.linkedin.com/company/geotab/>
> >
>

Reply via email to