I think I can explain this away as expected behaviour / quirks between different clients and masters. The key piece of information I was missing was what happens to the PHC as soon as ptp4l starts.
What I hadn't seen is that ptp4l immediately stepped the clock on startup based on the VS1's faster announce messages. So the clock is stepped based on the VS1's poorly advertised time, and then when the slower announce messages from the VS2 take priority the PHC is already 37 seconds out. Even though no names are named, I think the VS1 sending "creative" PTP Announce messages has been discussed here previously (https://sourceforge.net/p/linuxptp/mailman/message/36006501/). We've only seen this problem on Metamako switches because we usually run PTPd as the client on all our Linux servers. There are some non-PTP spec compliant configuration options in PTPd that allow it to be a little more flexible with how it interprets the advertised UTC offset - it can and has been told to absolutely trust the UTC Offset field, no matter what other flags are set. I've got the vague feeling I knew this 2 years ago and have forgotten... Either way, I think ptp4l can't handle the craziness because it's doing the "right thing". Since you're never going to make ptp4l less compliant, I think the only solution here is to never mix these two types of Grand Masters. ________________________________ From: Luke Bigum <luke.bi...@lmax.com> Sent: 30 April 2019 17:16 To: linuxptp-users Subject: [Linuxptp-users] ptp4l confusing UTC offset of different masters Hello, I've got some odd PTP behaviour from a Metamako switch (which runs it's own version of linuxptp-1.9), and have reproduced similar behaviour with linuxptp-2.0. We've got a new Grand Master appliance on trial, an Orelia VelaSync 2. It's plugged into a LAN in our lab along side it's predecessor, the VelaSync 1 (which is really FSM Labs' TimeKeeper Server). These masters deal with the UTC offset and leap seconds differently. The VS1 / TimeKeeper takes the inventive approach of publishing PTP messages containing UTC time and a UTC offset of zero (it does this so it can be in control of leap second slewing). The VS2 simply publishes TAI time with a UTC offset of 37 seconds. As I understand it, net result should be the same: UTC - 0s and TAI - 37s. I've set the VS2 priority1 field to 127, higher than the VS1's priority1 of 128 (the point of the exercise is to test the VS2). What appears to happen is ptp4l first tries to calibrate to the VS1 (the VS1 sends a boat load more Announce messages than the VS2), and sets the UTC offset to zero: ptp4l[419887.425]: selected best master clock 248a07.fffe.fdb701 ptp4l[419887.425]: updating UTC offset to 0 ptp4l[419887.425]: port 1: LISTENING to UNCALIBRATED on RS_SLAVE ptp4l[419888.699]: rms 172 max 221 freq -24777 +/- 102 delay 2141 +/- 7 ptp4l[419889.727]: rms 96 max 141 freq -24730 +/- 28 delay 2142 +/- 4 ptp4l[419890.739]: rms 54 max 79 freq -24747 +/- 41 delay 2145 +/- 3 Not long after the VS2 is discovered, and selected as the best clock. The new UTC offset flag is set to match the different time in the Follow_Up messages, but the clock jumps 37 seconds anyway: ptp4l[419888.434]: port 1: new foreign master ec0d9a.fffe.2a5680-1 ptp4l[419892.434]: selected best master clock ffffff.ffff.ffffff ptp4l[419892.434]: updating UTC offset to 37 ptp4l[419892.434]: port 1: SLAVE to UNCALIBRATED on RS_SLAVE ptp4l[419892.811]: port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED ptp4l[419893.306]: selected best master clock ffffff.ffff.ffffff ptp4l[419893.306]: updating UTC offset to 37 ptp4l[419893.434]: rms 32034559090 max 37000000060 freq -23256182 +/- 13412686 delay 2139 +/- 20 ptp4l[419894.434]: rms 36963221067 max 36976771182 freq -31000000 +/- 0 ptp4l[419894.841]: selected best master clock ffffff.ffff.ffffff ptp4l[419894.841]: updating UTC offset to 37 ptp4l[419895.352]: selected best master clock ffffff.ffff.ffffff ptp4l[419895.352]: updating UTC offset to 37 ptp4l[419895.434]: rms 36930870374 max 36945061895 freq -31000000 +/- 0 delay -1099123 +/- 367074 Note the clock ID of 0xffff here is a bogus value from the VS2 announce messages, it's populating it's grandmasterClockIdentity field with a reserved address. I've already emailed Orelia about it (side question; why is ptp4l logging the grandmasterClockIdentity field in the log file as the "master" and not the clockIdentity field?). When we were just running the VS1, everything was fine, this worked. If you shut off the VS1 (or change it's PTP domain number) so just the VS2 messages are "seen", everything is also fine, it works. The craziness only occurs when there is a mix of messages from both masters on the LAN. Also, this doesn't affect 'ptpd' (which is what most of my clients run), it only affects ptp4l. The other odd thing from the log (and the Metamako switch) is it constantly spits out these messages: selected best master clock ffffff.ffff.ffffff Which comes from handle_state_decision_event() in clock.c (remember, ignore the clock ID), and: updating UTC offset to 37 Which is in clock_update_slave(). That seems odd to me... I would expect to see them during/after a BMC election, not over and over again. The PTP priority hasn't changed, VS2 is still best clock. Part of the condition of the second message is: (c->tds.currentUtcOffset > c->utc_offset) I wonder if the VS1 messages are somehow being processed (even though it's not master), and the UTC offset flag is being polluted and this is what's jamming the clock away from real time? Full ptp4l log file is attached. -- Luke Bigum Head of Systems, LMAX Exchange
_______________________________________________ Linuxptp-users mailing list Linuxptp-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linuxptp-users