Hi all, I am observing an extrange behavior: under heavy-error CAN TX scenario (no acks so TX fails always), usually after the second call to write() my writes fail. This is expected as s32k1xx_flexcan has two TX mailboxes and from my understanding of the code there is no other buffering (on this specific CAN driver at least).
However, if I enable CAN errors, and depending on runtime sync conditions (basically, if I put a breakpoint on s32k1xx_txpoll) then all the writes after the first silently fail, without really trying to send anything (I see on ESR2 register than second TX mailbox does not really become active). After some debugging, I have seen that the CANWORK=LPWORK thread has scheduled calls to s32k1xx_error_work, overwrites d_buf/d_len to send the error frames in. But TX polling does not always happen in CANWORK thread: it does when it comes from s32k1xx_txdone_work, but not when it comes from s32k1xx_txavail_work, which, despite the name, is called directly on s32k1xx_txavail context (which is the application context). What is happening in my case, is txavail_work->...->devif_poll->can_poll is setting d_buf/d_len to the packed to be sent, but before s32k1xx_txpoll checks it (due to my breakpoint), s32k1xx_error kicks in, "steals" d_buf/d_len to setup the error frame and calls can_input. The frame which was set up by the polling sequence gets silently discarded. I have tried setting s32k1xx_txavail_work inside CANWORK, but this fails because can_sendmsg checks immediately for non-blocking writes; placing the polling in the work queue would cause all non-blocking writes to fail. Related to this, I also fail to see the arbitration between TX/RX for d_buf/d_len. From what I see, the same problem I am describing could happen by s32k1xx_receive "stealing" d_buf/d_len, same as s32k1xx_error is doing in my case. But this is only a thought, I have not observed it. A possible clean solution is to use another buffer, but it is complex and would mean losing the direct connection between the write and the HW TX (which might be useful in general and it is for my use case). A quicker solution would be for s32k1xx_error to lock the network, forcing it to wait until txavail_work is done. This would solve my case. My second concern is more difficult to solve as comments in the code explicitly say RX cannot be delayed to the work queue or CAN frames would be lost. Any ideas or anything I might be missing here? Thanks, Carlos -- Carlos Sanchez (he, him, his) Geotab Embedded Systems Developer Team Lead | Europe Visit www.geotab.com Twitter <https://twitter.com/geotab> | Facebook <https://www.facebook.com/Geotab> | YouTube <https://www.youtube.com/user/MyGeotab> | LinkedIn <https://www.linkedin.com/company/geotab/>