Hi Miroslav,

Sure. Say we're running as a single-port ordinary clock: `ptp4l -i eth0`
and the system also has at least one other interface (eth1, ...)  although
ptp4l is not configured to do anything with, or care about them.

When eth0 goes into a FAULTY state, a 16 second (by default) timeout is
armed, to schedule a clearing of the fault and trying again, so long as
eth0 has link-up at that time.

In the case I saw (there may be others) my eth0 went FAULTY because I am
able to induce a lost tx-timestamp from the hw/driver, but I suppose maybe
the NIC lost it, maybe the network stack dropped the outgoing skb for some
reason, etc.

Ordinarily, we get the logs about "timed out while polling for tx timestamp"
and "increasing tx_timestamp_timeout may correct this issue, but it is
likely caused by a driver bug". After 16 seconds the timeout expires, ptp4l
logs "clearing fault on port 1" and the port resuscitates itself. This is
fine and good and expected.

However - if eth1 (or any other interface that is not eth0) is flapping
(perhaps other status changes trigger this too?), then we are still
processing a stream of netlink messages. At the bottom of the call stack,
rtnl_link_status() correctly realises that they're not related to eth0 so
the continue case hits and we never call the callback. My understanding is
that this is effectively a NOP as far as the port's own state is concerned.
But, after we come back up the stack to clock_poll() after not doing
anything about the netlink message, we hit his part:
 /* Clear any fault after a little while. */
 if (PS_FAULTY == port_state(p)) {
    clock_fault_timeout(p, 1);
    break;
 }
Which is like a level-triggered check on the port state. Even though it was
already Faulty, this check hits and the timeout gets reset. We must wait 16
more seconds, and hope not to get another RTNL message that resets the
clock again.

This is somewhat alluded to in
https://github.com/richardcochran/linuxptp/commit/eec07ae24092517ba35e366b8629468fc9de66a4

which fixed a similar re-arming issue bewteen ports: "Arm the
fault-clearing timer only when an event causes a port to change state to
PS_FAULTY." But we do the arming for any event while the state *is* Faulty
(not when changing *to* faulty), and I believe is only necessary on the
transition.

One possible fix is to note the previous state and only arm the timeout on
transition to faulty, making it edge-triggered instead, as submitted for
review.
Perhaps other fixes might include port_event returning whether it was
responsible for the current fault state or whether it wants a timeout to be
[re]set, etc. Alternatively there might be a way to prevent RTNL messages
for interfaces we dont care about arriving in the first place, I am not
terribly familiar with this API.

In summary:

Previously: if eth0 goes FAULTY, it may *never* recover if totally
unrelated eth1 is flapping and producing a stream of useless RTNL messages.
With this patch (or an equivalent): if eth0 goes faulty, the fault clears
and retries after 16 seconds no matter what eth1 might be doing.

Thanks,
David

Logs:
ptp4l[1561.820]: port 1: master tx announce timeout
ptp4l[1561.882]: port 1: master sync timeout
ptp4l[1562.883]: port 1: master sync timeout
ptp4l[1562.893]: timed out while polling for tx timestamp
ptp4l[1562.894]: increasing tx_timestamp_timeout may correct this issue,
but it is likely caused by a driver bug
ptp4l[1562.894]: port 1: send sync failed
ptp4l[1562.894]: port 1: MASTER to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
ptp4l[1562.896]: waiting 2^{4} seconds to clear fault on port 1
ptp4l[1571.922]: port 1: received link status notification
<snip>
ptp4l[1619.391]: port 1: received link status notification
ptp4l[1619.396]: waiting 2^{4} seconds to clear fault on port 1
ptp4l[1635.397]: clearing fault on port 1
..
ptp4l[1635.488]: port 1: FAULTY to LISTENING on INIT_COMPLETE
ptp4l[1635.491]: port 1: received link status notification
ptp4l[1635.494]: interface index 151 is up

The missing tx-timestamp is expected, but waiting 72.5 seconds instead of
16 before clearing the fault is not; because link-status-notifications
about unrelated links (confirmed via strace) reset the clock each time.
Even then, it only cleared after 72s because I stopped flapping eth1
manually, I expect this would have continued indefinitely had I persisted.

(it might be worth noting that in my case eth0 and eth1 are different NICs,
different drivers, etc so I dont believe it to be a cross-pollination issue
at the hw/driver layer)


On Wed, 23 Nov 2022 at 18:55, Miroslav Lichvar <mlich...@redhat.com> wrote:

> On Tue, Nov 22, 2022 at 08:59:48PM -0800, davidjm via Linuxptp-devel wrote:
> > Set the timer only when an event causes the port to transition to the
> > FAULTY state. Previously this was done based only on the port's current
> > state, and events on that port (such as RTNL messages) cause it to
> > repeatedly re-arm itself, potentially never clearing.
>
> Do you please provide an example where this makes a difference?
>
> --
> Miroslav Lichvar
>
>
_______________________________________________
Linuxptp-devel mailing list
Linuxptp-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linuxptp-devel

Reply via email to