Hi Grygorii, On Thu, Dec 24, 2020 at 06:36:17PM +0200, Grygorii Strashko wrote: > Hi > > I've found that it's pretty easy to put ptp4l/phc2sys BC in sync fault state > while port is in transitional UNCALIBRATED state: > MASTER -> UNCALIBRATED > UNCALIBRATED -> SLAVE > in the later case state change sequence is > UNCALIBRATED -> SLAVE -> SYNCHRONIZATION_FAULT -> UNCALIBRATED. > Once issue happens the BC or GM has to be restarted to recover (not always > helps). > > To trigger an issue the GM clock has to be restarted few times. > > The cause of the issue is that both ptp4l and phc2sys could be syncing the > same > PHC clock while Port is in UNCALIBRATED state. This happens because > in reconfigure() the check [1] > > [1] if (dst_cnt > 1 && !src) { > ... > if (src_cnt > 1) { > ... > [2] if (src_cnt > 0 && !src) { > ... > if (!src_cnt && !dst_cnt) { > ... > > is defined first and selects last clock in the clocks list (usually first > defined > in ptp4l cfg file) as the master clock even if there is some other > clock(Port) is > in transitional UNCALIBRATED state. As result, phc2sys will start syncing the > same clock as ptp4l, and ptp4l has no chances to stabilize Port in > UNCALIBRATED > state. Moreover, check [1] bypasses most of follow up checks unless > CLOCK_REALTIME is allowed to be master clock. > > The original commit 46a0b281b915 ("phc2sys: autoconfiguration") gave ptp4l a > chance to proceed by introducing check [2]. But follow up commit 2ab2fbbdda86 > ("phc2sys: default to the first clock in automatic mode.") introduced check > [1] > and so this issue. > > To fix an issue the check [2] is moved down which makes phc2sys to wait for > ptp4l to settle. > > Patch 1 - add debug prints when clock state is changed > Patch 2 - fixes an issue > > BC configuration (jbod): > - eth1, phc index 0 > - pru10, phc index 1 > - pru11, phc index 1 > - pru20, phc index 2 <-- GM > - eth21, phc index 2 > > BC started as: > sleep > ./linuxptp/ptp4l -mf abb_bc.cfg -l5 & > sleep 2 > ./linuxptp/phc2sys -a -S 0.00002 -L 0 -m -l7 -u$1 & > > Example log: > ptp4l[688.126]: port 4: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED > phc2sys[688.294]: port 5cf821.fffe.3b0305-4 changed state > phc2sys[688.295]: reconfiguring after port state change > phc2sys[688.295]: selecting pru21 for synchronization > phc2sys[688.295]: pru20: state change UNCALIBRATED -> SLAVE > phc2sys[688.295]: selecting pru11 for synchronization > phc2sys[688.295]: skipping pru10: pru11 has the same clock and is already > selected > phc2sys[688.296]: eth1: state change SLAVE -> MASTER > phc2sys[688.296]: selecting eth1 for synchronization > phc2sys[688.296]: selecting pru20 as the master clock > ^^^^^^ GM detected properly and BC sync switched to pru20 as master > > phc2sys[688.296]: eth1 rms 21 max 21 freq +22755 +/- 0 delay 26781 +/- 0 > phc2sys[688.296]: pru11 rms 134 max 134 freq +22664 +/- 0 delay 5200 +/- 0 > phc2sys[689.297]: eth1 rms 554 max 554 freq +23376 +/- 0 delay 26135 +/- 0 > ... > phc2sys[693.299]: pru11 rms 205 max 205 freq +22725 +/- 0 delay 5135 +/- 0 > ptp4l[694.126]: port 4: SLAVE to MASTER on ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES > ptp4l[694.126]: selected local clock 5cf821.fffe.3b0305 as best master > ptp4l[694.126]: port 1: assuming the grand master role > ptp4l[694.126]: port 2: assuming the grand master role > ptp4l[694.126]: port 3: assuming the grand master role > ptp4l[694.126]: port 4: assuming the grand master role > ptp4l[694.126]: port 5: assuming the grand master role > phc2sys[694.299]: port 5cf821.fffe.3b0305-4 changed state > phc2sys[694.300]: reconfiguring after port state change > phc2sys[694.300]: selecting pru21 for synchronization > phc2sys[694.300]: pru20: state change SLAVE -> MASTER > phc2sys[694.300]: skipping pru20: pru21 has the same clock and is already > selected > phc2sys[694.300]: selecting pru11 for synchronization > phc2sys[694.300]: skipping pru10: pru11 has the same clock and is already > selected > phc2sys[694.300]: selecting eth1 for synchronization > phc2sys[694.300]: no source, selecting eth1 as the default clock > ^^^^^^ GM stopped and BC sync switched to eth1 as master > > phc2sys[694.301]: pru11 rms 582 max 582 freq +23450 +/- 0 delay 26930 +/- 0 > phc2sys[694.303]: pru21 rms 686 max 686 freq +23427 +/- 0 delay 26929 +/- 0 > ... > phc2sys[700.310]: pru11 rms 214 max 214 freq +22603 +/- 0 delay 26844 +/- 0 > phc2sys[700.311]: pru21 rms 111 max 111 freq +22310 +/- 0 delay 26485 +/- 0 > ptp4l[701.130]: selected best master clock 5051a9.fffe.fc7697 > ptp4l[701.130]: port 4: MASTER to UNCALIBRATED on RS_SLAVE > phc2sys[701.311]: port 5cf821.fffe.3b0305-4 changed state > phc2sys[701.313]: reconfiguring after port state change > phc2sys[701.313]: selecting pru21 for synchronization > phc2sys[701.313]: pru20: state change MASTER -> UNCALIBRATED > phc2sys[701.313]: selecting pru11 for synchronization > phc2sys[701.313]: skipping pru10: pru11 has the same clock and is already > selected > phc2sys[701.313]: eth1: state change SLAVE -> MASTER > phc2sys[701.313]: selecting eth1 for synchronization > phc2sys[701.313]: no source, selecting eth1 as the default clock > ^^^^^^ GM detected pru20=UNCALIBRATED, but BC sync still uses eth1 as master > at this moment both ptp4l and phc2sys syncing the same pru20 clock
Why would phc2sys attempt to add pru20 here to its list of destination (synchronizable) clocks if it's aware that it's in the UNCALIBRATED state (aka a candidate for a source clock, but definitely not a destination clock)? I would expect that even if eth1 is still temporarily used as source clock by phc2sys, phc2sys should never attempt to discipline pru20's clock. That means that the problem is still potentially left partially unsolved. In this case, the reason why pru20's clock is chosen for synchronization is not due to pru20's state, but due to pru21's state, and due to the fact that pru20 and pru21 share the same PHC. I would expect that after computing all the candidate source clocks, phc2sys removes those source clocks from the list of destination clocks, to proactively avoid any issues like this. > > phc2sys[701.314]: pru11 rms 41 max 41 freq +22412 +/- 0 delay 26805 +/- 0 > phc2sys[701.314]: pru21 rms 266 max 266 freq +22654 +/- 0 delay 27059 +/- 0 > phc2sys[702.314]: pru11 rms 73 max 73 freq +22368 +/- 0 delay 26644 +/- 0 > phc2sys[702.314]: pru21 rms 134 max 134 freq +22334 +/- 0 delay 26774 +/- 0 > phc2sys[703.315]: pru11 rms 106 max 106 freq +22313 +/- 0 delay 26509 +/- 0 > phc2sys[703.315]: pru21 rms 4554 max 4554 freq +17874 +/- 0 delay 26539 +/- 0 > phc2sys[704.315]: pru11 rms 34 max 34 freq +22421 +/- 0 delay 26540 +/- 0 > phc2sys[704.317]: pru21 rms 0 max -0 freq +21061 +/- 0 delay 26505 +/- 0 > phc2sys[705.317]: pru11 rms 4 max 4 freq +22401 +/- 0 delay 26520 +/- 0 > phc2sys[705.317]: pru21 rms 1548 max 1548 freq +22609 +/- 0 delay 26885 +/- 0 > phc2sys[706.318]: pru11 rms 140 max 140 freq +22539 +/- 0 delay 26729 +/- 0 > phc2sys[706.318]: pru21 rms 5790 max 5790 freq +15736 +/- 0 delay 26399 +/- 0 > phc2sys[707.318]: pru11 rms 108 max 108 freq +22549 +/- 0 delay 26900 +/- 0 > phc2sys[707.319]: pru21 rms 1204 max 1204 freq +20993 +/- 0 delay 27145 +/- 0 > ... > phc2sys[752.388]: pru11 rms 114 max 114 freq +22360 +/- 0 delay 26744 +/- > 0 > phc2sys[752.389]: pru21 rms 3866487 max 3866487 freq -972697 +/- 0 delay > 26663 +/- 0 > phc2sys[753.389]: pru11 rms 81 max 81 freq +22359 +/- 0 delay 26644 +/- > 0 > phc2sys[753.390]: pru21 rms 6925537 max 6925537 freq -4039400 +/- 0 delay > 27141 +/- 0 > ^^^^^^ result: pru20 stuck in UNCALIBRATED state and it's clock going out of > sync > > Log after fix: > ptp4l[265.606]: port 4: SLAVE to MASTER on ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES > ptp4l[265.606]: selected local clock 5cf821.fffe.3b0305 as best master > ptp4l[265.606]: port 1: assuming the grand master role > ptp4l[265.606]: port 2: assuming the grand master role > ptp4l[265.606]: port 3: assuming the grand master role > ptp4l[265.606]: port 4: assuming the grand master role > ptp4l[265.606]: port 5: assuming the grand master role > phc2sys[265.618]: port 5cf821.fffe.3b0305-4 changed state > phc2sys[265.618]: reconfiguring after port state change > phc2sys[265.618]: selecting pru21 for synchronization > phc2sys[265.618]: pru20: state change SLAVE -> MASTER > phc2sys[265.618]: skipping pru20: pru21 has the same clock and is already > selected > phc2sys[265.618]: selecting pru11 for synchronization > phc2sys[265.618]: skipping pru10: pru11 has the same clock and is already > selected > phc2sys[265.618]: selecting eth1 for synchronization > phc2sys[265.618]: no source, selecting eth1 as the default clock > phc2sys[265.619]: pru11 rms 463 max 463 freq +23348 +/- 0 delay 26844 +/- 0 > phc2sys[265.619]: pru21 rms 298 max 298 freq +23268 +/- 0 delay 26420 +/- 0 > ... > phc2sys[269.622]: pru11 rms 141 max 141 freq +22977 +/- 0 delay 26489 +/- 0 > phc2sys[269.623]: pru21 rms 196 max 196 freq +22941 +/- 0 delay 26454 +/- 0 > ptp4l[270.610]: selected best master clock 5051a9.fffe.fc7697 > ptp4l[270.610]: port 4: MASTER to UNCALIBRATED on RS_SLAVE > phc2sys[270.623]: port 5cf821.fffe.3b0305-4 changed state > phc2sys[270.624]: reconfiguring after port state change > phc2sys[270.624]: selecting pru21 for synchronization > phc2sys[270.625]: pru20: state change MASTER -> UNCALIBRATED > phc2sys[270.625]: selecting pru11 for synchronization > phc2sys[270.625]: skipping pru10: pru11 has the same clock and is already > selected > phc2sys[270.625]: eth1: state change SLAVE -> MASTER > phc2sys[270.625]: selecting eth1 for synchronization > phc2sys[270.625]: master clock not ready, waiting... > ptp4l[272.609]: port 4: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED > phc2sys[272.625]: port 5cf821.fffe.3b0305-4 changed state > phc2sys[272.626]: reconfiguring after port state change > phc2sys[272.626]: selecting pru21 for synchronization > phc2sys[272.627]: pru20: state change UNCALIBRATED -> SLAVE > phc2sys[272.627]: selecting pru11 for synchronization > phc2sys[272.627]: skipping pru10: pru11 has the same clock and is already > selected > phc2sys[272.627]: selecting eth1 for synchronization > phc2sys[272.627]: selecting pru20 as the master clock > phc2sys[272.627]: eth1 rms 175 max 175 freq +23076 +/- 0 delay 26405 +/- 0 > phc2sys[272.627]: pru11 rms 64 max 64 freq +23012 +/- 0 delay 5144 +/- 0 > phc2sys[273.628]: eth1 rms 428 max 428 freq +22675 +/- 0 delay 26488 +/- 0 > phc2sys[273.628]: pru11 rms 232 max 232 freq +22825 +/- 0 delay 5160 +/- 0 _______________________________________________ Linuxptp-devel mailing list Linuxptp-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linuxptp-devel