Hi Miroslav, Sure. Say we're running as a single-port ordinary clock: `ptp4l -i eth0` and the system also has at least one other interface (eth1, ...) although ptp4l is not configured to do anything with, or care about them.
When eth0 goes into a FAULTY state, a 16 second (by default) timeout is armed, to schedule a clearing of the fault and trying again, so long as eth0 has link-up at that time. In the case I saw (there may be others) my eth0 went FAULTY because I am able to induce a lost tx-timestamp from the hw/driver, but I suppose maybe the NIC lost it, maybe the network stack dropped the outgoing skb for some reason, etc. Ordinarily, we get the logs about "timed out while polling for tx timestamp" and "increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug". After 16 seconds the timeout expires, ptp4l logs "clearing fault on port 1" and the port resuscitates itself. This is fine and good and expected. However - if eth1 (or any other interface that is not eth0) is flapping (perhaps other status changes trigger this too?), then we are still processing a stream of netlink messages. At the bottom of the call stack, rtnl_link_status() correctly realises that they're not related to eth0 so the continue case hits and we never call the callback. My understanding is that this is effectively a NOP as far as the port's own state is concerned. But, after we come back up the stack to clock_poll() after not doing anything about the netlink message, we hit his part: /* Clear any fault after a little while. */ if (PS_FAULTY == port_state(p)) { clock_fault_timeout(p, 1); break; } Which is like a level-triggered check on the port state. Even though it was already Faulty, this check hits and the timeout gets reset. We must wait 16 more seconds, and hope not to get another RTNL message that resets the clock again. This is somewhat alluded to in https://github.com/richardcochran/linuxptp/commit/eec07ae24092517ba35e366b8629468fc9de66a4 which fixed a similar re-arming issue bewteen ports: "Arm the fault-clearing timer only when an event causes a port to change state to PS_FAULTY." But we do the arming for any event while the state *is* Faulty (not when changing *to* faulty), and I believe is only necessary on the transition. One possible fix is to note the previous state and only arm the timeout on transition to faulty, making it edge-triggered instead, as submitted for review. Perhaps other fixes might include port_event returning whether it was responsible for the current fault state or whether it wants a timeout to be [re]set, etc. Alternatively there might be a way to prevent RTNL messages for interfaces we dont care about arriving in the first place, I am not terribly familiar with this API. In summary: Previously: if eth0 goes FAULTY, it may *never* recover if totally unrelated eth1 is flapping and producing a stream of useless RTNL messages. With this patch (or an equivalent): if eth0 goes faulty, the fault clears and retries after 16 seconds no matter what eth1 might be doing. Thanks, David Logs: ptp4l[1561.820]: port 1: master tx announce timeout ptp4l[1561.882]: port 1: master sync timeout ptp4l[1562.883]: port 1: master sync timeout ptp4l[1562.893]: timed out while polling for tx timestamp ptp4l[1562.894]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug ptp4l[1562.894]: port 1: send sync failed ptp4l[1562.894]: port 1: MASTER to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED) ptp4l[1562.896]: waiting 2^{4} seconds to clear fault on port 1 ptp4l[1571.922]: port 1: received link status notification <snip> ptp4l[1619.391]: port 1: received link status notification ptp4l[1619.396]: waiting 2^{4} seconds to clear fault on port 1 ptp4l[1635.397]: clearing fault on port 1 .. ptp4l[1635.488]: port 1: FAULTY to LISTENING on INIT_COMPLETE ptp4l[1635.491]: port 1: received link status notification ptp4l[1635.494]: interface index 151 is up The missing tx-timestamp is expected, but waiting 72.5 seconds instead of 16 before clearing the fault is not; because link-status-notifications about unrelated links (confirmed via strace) reset the clock each time. Even then, it only cleared after 72s because I stopped flapping eth1 manually, I expect this would have continued indefinitely had I persisted. (it might be worth noting that in my case eth0 and eth1 are different NICs, different drivers, etc so I dont believe it to be a cross-pollination issue at the hw/driver layer) On Wed, 23 Nov 2022 at 18:55, Miroslav Lichvar <mlich...@redhat.com> wrote: > On Tue, Nov 22, 2022 at 08:59:48PM -0800, davidjm via Linuxptp-devel wrote: > > Set the timer only when an event causes the port to transition to the > > FAULTY state. Previously this was done based only on the port's current > > state, and events on that port (such as RTNL messages) cause it to > > repeatedly re-arm itself, potentially never clearing. > > Do you please provide an example where this makes a difference? > > -- > Miroslav Lichvar > >
_______________________________________________ Linuxptp-devel mailing list Linuxptp-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linuxptp-devel