Hello

I have several servers doing measurements. Each server is logging its 
timestamped measurements independently. About every few seconds, all the 
measurements are grouped together in a database. Within this cluster of 
servers, it is very important to accurately reconcile the timestamped 
measurements to have a consistent time series. Until now, a few milliseconds 
were acceptable so I was using NTP. Now, the sampling frequency has increased, 
so I need to be much more accurate. I am trying to do that with PTP.

This seems to be a use case very similar to 
https://sourceforge.net/p/linuxptp/mailman/message/35665802/ :  "Relative 
(device to device) time accuracy is important but absolute world-time only 
needs to match roughly (couple of milliseconds)".
To be honest, absolute world time accuracy is a very distant concern for now: 
it is preferable to increase the relative accuracy by several microseconds, 
even if the trade-off for doing that would degrade the absolute time accuracy 
by several milliseconds (!!). I haven't found many people with the same 
concerns.

The servers are in a datacenter where I can't add anything like a GPS 
grandmaster, but I have otherwise full control of the baremetal servers: to 
help with other issues related to database latency jitter, I now have each 
server directly connected to every other in pairs, using simple crossover RJ45 
cables between extra NICs. All the NICs are Intel e1000e.
For example, for a cluster of 3 servers:

Serv1->switch->internet : eth0
Serv1->Serv2 : eth2
Serv1->Serv3 : eth3

Serv2->switch->->internet : eth0
Serv2->Serv1 : eth1
Serv2->Serv3 : eth3

Serv3->switch->->internet : eth0
Serv3->Serv1 : eth1
Serv3->Serv2 : eth2

Each link is configured with IPv4 and IPv6, and everything works (ping, ping6, 
arping...).
Using arping, to give an idea of the jitter on the lan, sending 10 packets:
- from server1 to server2:
rtt min/avg/max/std-dev = 0.111/0.135/0.181/0.018 ms
- from server2 to server3 (link with the lowest load):
rtt min/avg/max/std-dev = 0.081/0.102/0.126/0.018 ms
- from server1 to server3 (link with the highest load):
rtt min/avg/max/std-dev = 0.122/0.311/0.846/0.235 ms

I would like now to take advantage of these extra NICS to increase the relative 
time accuracy by as much as I can, but I am an absolute beginner with PTP.
I have read everything I could, then I started to use linuxptp timemaster with 
the NTP servers from the datacenter and my NICs simply configured for ptp4l 
with a 150 us tolerance given the rtt results, ie on server1:

[ptp_domain 1]

interfaces eth3
ptp4l_option clock_servo linreg
# 150 microseconds
delay 150e-6

(likewise for eth2, and eth0 even if there's a switch which may add delays)

Then 'systemctl status timemaster' shows everything running normally:

CGroup: /system.slice/timemaster.service
├─53151 /usr/sbin/timemaster -f /etc/linuxptp/timemaster.conf
├─53153 /usr/sbin/chronyd -n -f /var/run/timemaster/chrony.conf
├─53156 /usr/sbin/ptp4l -l 5 -f /var/run/timemaster/ptp4l.0.conf -H -i eth0
├─53157 /usr/sbin/phc2sys -E linreg -a -r -R 1.00 -z 
/var/run/timemaster/ptp4l.0.socket -t [0:eth0] -n 0 -E ntpshm -M 0
├─53158 /usr/sbin/ptp4l -l 5 -f /var/run/timemaster/ptp4l.1.conf -H -i eth2
├─53160 /usr/sbin/phc2sys -E linreg -a -r -R 1.00 -z 
/var/run/timemaster/ptp4l.1.socket -t [1:eth2] -n 1 -E ntpshm -M 1
├─53161 /usr/sbin/ptp4l -l 5 -f /var/run/timemaster/ptp4l.2.conf -H -i eth3
└─53163 /usr/sbin/phc2sys -E linreg -a -r -R 1.00 -z 
/var/run/timemaster/ptp4l.2.socket -t [1:eth3] -n 1 -E ntpshm -M 2


However, PTP doesn't work unless I start manually ptp4l on the other servers, 
using specifically created systemd scripts to manage the separate interfaces 
and master/slave options. 'chrony sources' shows:

210 Number of sources = 6
MS Name/IP address Stratum Poll Reach LastRx Last sample
===============================================================================
#? PTP0 0 2 0 - +0ns[ +0ns] +/- 0ns
#? PTP1 0 2 0 - +0ns[ +0ns] +/- 0ns
#? PTP2 0 2 0 - +0ns[ +0ns] +/- 0ns
(...)


Based on my understanding of timemaster, this is because it hardcodes 
"slaveOnly 1" in the scripts it creates in /var/run/timemaster/ptp4l.?.conf, 
and adding SlaveOnly 0 to the [ptp4l.conf] section of 
/etc/linuxptp/timemaster.conf doesn't help (both SlaveOnly 0 and SlaveOnly 1 
are then present in the configuration file). With the default configuration, 
all the servers are slaveonly and nobody ever becomes a master.

To fix that, I created custom systemd scripts instead of using timemaster, for 
example on server3 I use:
ExecStart=/usr/sbin/phc2sys -w -z /var/run/ptp4l.%i.socket -s %i
(...)
ExecStart=/usr/sbin/ptp4l -f /etc/linuxptp/ptp4l.conf 
--uds_address=/var/run/ptp4l.%i.socket -i %i
ptp4l.conf is the vanilla debian configuration file with just clock_servo 
linreg added

Then on server1, I can get chrony sources to at least see the PTP from eth3 
using a very vanilla timemaster.conf:

[ptp_domain 0]
interfaces eth0
delay 150e-6

[ptp_domain 1]
interfaces eth2
ptp4l_option clock_servo linreg
delay 150e-6

[ptp_domain 1]
interfaces eth3
ptp4l_option clock_servo linreg
delay 150e-6

[timemaster]
ntp_program chronyd

[chrony.conf]
include /etc/chrony.conf

[ntp.conf]
includefile /etc/ntp.conf

[ptp4l.conf]

[chronyd]
path /usr/sbin/chronyd

[ntpd]
path /usr/sbin/ntpd
options -u ntp:ntp -g

[phc2sys]
path /usr/sbin/phc2sys
options -E linreg

[ptp4l]
path /usr/sbin/ptp4l

I just need to start the services manually on server3:

systemctl start ptp4l@eth1
systemctl start phc2sys@eth1


Yet if I do that, the "#x" indicates that chronyc thinks the PTP is a 
falseticker:
#x PTP2 0 2 77 3 +97.9s[ +97.9s] +/- 44ms

I can't fault it for thinking that, since there is a 98 second offset (!!)
If I wait a bit, the accuracy improves, but not the offset:
#x PTP2 0 2 377 6 +97.9s[ +97.9s] +/- 5442us
(...)
#x PTP2 0 2 377 4 +97.9s[ +97.9s] +/- 379us

(...)

#x PTP2                          0   2   177     5   +98.0s[ +98.0s] +/-   82us

I believe I am doing several thing wrong - for example, I should set the PHC 
using the system time instead of doing the opposite, but I have not found how 
to do that with timemaster. I did some more reading, but I could not find 
anything close to what I need except maybe https://github.com/not1337/pps-stuff 
which use a GPS to serve the time by PTP. Based on that, I run instead of 
systemd phc2sys@eth1 a "sys2phc" script on server3: phc2sys -s CLOCK_REALTIME 
-c eth1 -O 0 -R 10 -N 2 -E linreg -L 50000000 -n 0 -q -m

Then server1 has time jumps back and forth on the PTP source, for ex:
phc2sys[4656203.109]: [1:eth3] eth3 sys offset 224 s2 freq -5704 delay 12840
phc2sys[4656203.209]: [1:eth3] eth3 sys offset -38178389 s2 freq -24180511 
delay 5408
phc2sys[4656203.309]: [1:eth3] clockcheck: clock jumped backward or running 
slower than expected!
phc2sys[4656203.309]: [1:eth3] eth3 sys offset -35789106 s0 freq -24180511 
delay 12946
phc2sys[4656203.410]: [1:eth3] eth3 sys offset -33367774 s0 freq -24180511 
delay 13232
phc2sys[4656203.510]: [1:eth3] eth3 sys offset -30946905 s0 freq -24180511 
delay 13396
phc2sys[4656203.610]: [1:eth3] eth3 sys offset -28525849 s2 freq -285263527 
delay 13355
phc2sys[4656203.710]: [1:eth3] eth3 sys offset 36575 s2 freq +360178 delay 16503
phc2sys[4656203.810]: [1:eth3] eth3 sys offset 4576 s2 freq +22734 delay 12835
phc2sys[4656203.910]: [1:eth3] eth3 sys offset 1830 s2 freq +18307 delay 12880
phc2sys[4656204.010]: [1:eth3] eth3 sys offset -523 s2 freq +12767 delay 12760

I tried killing the phc2sys and ptp4l started by timemaster on server1 to 
replace it by a similar scripts, but it doesn't help.

At this point, I am stuck. I can't get PTP to work.

Could someone please help me with the configuration to at least have PTP 
working in UDP?

I would also be interested in replacing UDPv4 by L2 since it may help with the 
accuracy, and any other tuning that could help the relative time accuracy 
inside the cluster (and the time it take to syncronize to the cluster in case 
of a reboot)

Thanks

_______________________________________________
Linuxptp-users mailing list
Linuxptp-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linuxptp-users

Reply via email to