Hello I have several servers doing measurements. Each server is logging its timestamped measurements independently. About every few seconds, all the measurements are grouped together in a database. Within this cluster of servers, it is very important to accurately reconcile the timestamped measurements to have a consistent time series. Until now, a few milliseconds were acceptable so I was using NTP. Now, the sampling frequency has increased, so I need to be much more accurate. I am trying to do that with PTP.
This seems to be a use case very similar to https://sourceforge.net/p/linuxptp/mailman/message/35665802/ : "Relative (device to device) time accuracy is important but absolute world-time only needs to match roughly (couple of milliseconds)". To be honest, absolute world time accuracy is a very distant concern for now: it is preferable to increase the relative accuracy by several microseconds, even if the trade-off for doing that would degrade the absolute time accuracy by several milliseconds (!!). I haven't found many people with the same concerns. The servers are in a datacenter where I can't add anything like a GPS grandmaster, but I have otherwise full control of the baremetal servers: to help with other issues related to database latency jitter, I now have each server directly connected to every other in pairs, using simple crossover RJ45 cables between extra NICs. All the NICs are Intel e1000e. For example, for a cluster of 3 servers: Serv1->switch->internet : eth0 Serv1->Serv2 : eth2 Serv1->Serv3 : eth3 Serv2->switch->->internet : eth0 Serv2->Serv1 : eth1 Serv2->Serv3 : eth3 Serv3->switch->->internet : eth0 Serv3->Serv1 : eth1 Serv3->Serv2 : eth2 Each link is configured with IPv4 and IPv6, and everything works (ping, ping6, arping...). Using arping, to give an idea of the jitter on the lan, sending 10 packets: - from server1 to server2: rtt min/avg/max/std-dev = 0.111/0.135/0.181/0.018 ms - from server2 to server3 (link with the lowest load): rtt min/avg/max/std-dev = 0.081/0.102/0.126/0.018 ms - from server1 to server3 (link with the highest load): rtt min/avg/max/std-dev = 0.122/0.311/0.846/0.235 ms I would like now to take advantage of these extra NICS to increase the relative time accuracy by as much as I can, but I am an absolute beginner with PTP. I have read everything I could, then I started to use linuxptp timemaster with the NTP servers from the datacenter and my NICs simply configured for ptp4l with a 150 us tolerance given the rtt results, ie on server1: [ptp_domain 1] interfaces eth3 ptp4l_option clock_servo linreg # 150 microseconds delay 150e-6 (likewise for eth2, and eth0 even if there's a switch which may add delays) Then 'systemctl status timemaster' shows everything running normally: CGroup: /system.slice/timemaster.service ├─53151 /usr/sbin/timemaster -f /etc/linuxptp/timemaster.conf ├─53153 /usr/sbin/chronyd -n -f /var/run/timemaster/chrony.conf ├─53156 /usr/sbin/ptp4l -l 5 -f /var/run/timemaster/ptp4l.0.conf -H -i eth0 ├─53157 /usr/sbin/phc2sys -E linreg -a -r -R 1.00 -z /var/run/timemaster/ptp4l.0.socket -t [0:eth0] -n 0 -E ntpshm -M 0 ├─53158 /usr/sbin/ptp4l -l 5 -f /var/run/timemaster/ptp4l.1.conf -H -i eth2 ├─53160 /usr/sbin/phc2sys -E linreg -a -r -R 1.00 -z /var/run/timemaster/ptp4l.1.socket -t [1:eth2] -n 1 -E ntpshm -M 1 ├─53161 /usr/sbin/ptp4l -l 5 -f /var/run/timemaster/ptp4l.2.conf -H -i eth3 └─53163 /usr/sbin/phc2sys -E linreg -a -r -R 1.00 -z /var/run/timemaster/ptp4l.2.socket -t [1:eth3] -n 1 -E ntpshm -M 2 However, PTP doesn't work unless I start manually ptp4l on the other servers, using specifically created systemd scripts to manage the separate interfaces and master/slave options. 'chrony sources' shows: 210 Number of sources = 6 MS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== #? PTP0 0 2 0 - +0ns[ +0ns] +/- 0ns #? PTP1 0 2 0 - +0ns[ +0ns] +/- 0ns #? PTP2 0 2 0 - +0ns[ +0ns] +/- 0ns (...) Based on my understanding of timemaster, this is because it hardcodes "slaveOnly 1" in the scripts it creates in /var/run/timemaster/ptp4l.?.conf, and adding SlaveOnly 0 to the [ptp4l.conf] section of /etc/linuxptp/timemaster.conf doesn't help (both SlaveOnly 0 and SlaveOnly 1 are then present in the configuration file). With the default configuration, all the servers are slaveonly and nobody ever becomes a master. To fix that, I created custom systemd scripts instead of using timemaster, for example on server3 I use: ExecStart=/usr/sbin/phc2sys -w -z /var/run/ptp4l.%i.socket -s %i (...) ExecStart=/usr/sbin/ptp4l -f /etc/linuxptp/ptp4l.conf --uds_address=/var/run/ptp4l.%i.socket -i %i ptp4l.conf is the vanilla debian configuration file with just clock_servo linreg added Then on server1, I can get chrony sources to at least see the PTP from eth3 using a very vanilla timemaster.conf: [ptp_domain 0] interfaces eth0 delay 150e-6 [ptp_domain 1] interfaces eth2 ptp4l_option clock_servo linreg delay 150e-6 [ptp_domain 1] interfaces eth3 ptp4l_option clock_servo linreg delay 150e-6 [timemaster] ntp_program chronyd [chrony.conf] include /etc/chrony.conf [ntp.conf] includefile /etc/ntp.conf [ptp4l.conf] [chronyd] path /usr/sbin/chronyd [ntpd] path /usr/sbin/ntpd options -u ntp:ntp -g [phc2sys] path /usr/sbin/phc2sys options -E linreg [ptp4l] path /usr/sbin/ptp4l I just need to start the services manually on server3: systemctl start ptp4l@eth1 systemctl start phc2sys@eth1 Yet if I do that, the "#x" indicates that chronyc thinks the PTP is a falseticker: #x PTP2 0 2 77 3 +97.9s[ +97.9s] +/- 44ms I can't fault it for thinking that, since there is a 98 second offset (!!) If I wait a bit, the accuracy improves, but not the offset: #x PTP2 0 2 377 6 +97.9s[ +97.9s] +/- 5442us (...) #x PTP2 0 2 377 4 +97.9s[ +97.9s] +/- 379us (...) #x PTP2 0 2 177 5 +98.0s[ +98.0s] +/- 82us I believe I am doing several thing wrong - for example, I should set the PHC using the system time instead of doing the opposite, but I have not found how to do that with timemaster. I did some more reading, but I could not find anything close to what I need except maybe https://github.com/not1337/pps-stuff which use a GPS to serve the time by PTP. Based on that, I run instead of systemd phc2sys@eth1 a "sys2phc" script on server3: phc2sys -s CLOCK_REALTIME -c eth1 -O 0 -R 10 -N 2 -E linreg -L 50000000 -n 0 -q -m Then server1 has time jumps back and forth on the PTP source, for ex: phc2sys[4656203.109]: [1:eth3] eth3 sys offset 224 s2 freq -5704 delay 12840 phc2sys[4656203.209]: [1:eth3] eth3 sys offset -38178389 s2 freq -24180511 delay 5408 phc2sys[4656203.309]: [1:eth3] clockcheck: clock jumped backward or running slower than expected! phc2sys[4656203.309]: [1:eth3] eth3 sys offset -35789106 s0 freq -24180511 delay 12946 phc2sys[4656203.410]: [1:eth3] eth3 sys offset -33367774 s0 freq -24180511 delay 13232 phc2sys[4656203.510]: [1:eth3] eth3 sys offset -30946905 s0 freq -24180511 delay 13396 phc2sys[4656203.610]: [1:eth3] eth3 sys offset -28525849 s2 freq -285263527 delay 13355 phc2sys[4656203.710]: [1:eth3] eth3 sys offset 36575 s2 freq +360178 delay 16503 phc2sys[4656203.810]: [1:eth3] eth3 sys offset 4576 s2 freq +22734 delay 12835 phc2sys[4656203.910]: [1:eth3] eth3 sys offset 1830 s2 freq +18307 delay 12880 phc2sys[4656204.010]: [1:eth3] eth3 sys offset -523 s2 freq +12767 delay 12760 I tried killing the phc2sys and ptp4l started by timemaster on server1 to replace it by a similar scripts, but it doesn't help. At this point, I am stuck. I can't get PTP to work. Could someone please help me with the configuration to at least have PTP working in UDP? I would also be interested in replacing UDPv4 by L2 since it may help with the accuracy, and any other tuning that could help the relative time accuracy inside the cluster (and the time it take to syncronize to the cluster in case of a reboot) Thanks
_______________________________________________ Linuxptp-users mailing list Linuxptp-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linuxptp-users