Hi,

The device is a:
  00:14.0 Ethernet controller: Intel Corporation Ethernet Connection I354
(rev 03)

Using
  bash-4.3# ethtool -i ma2
  driver: igb
  version: 5.3.0-k
  firmware-version: 0.0.0
  expansion-rom-version:
  bus-info: 0000:00:14.1
  supports-statistics: yes
  supports-test: yes
  supports-eeprom-access: yes
  supports-register-dump: yes
  supports-priv-flags: no

And:
 # ethtool -T ma2
 Time stamping parameters for ma2:
 Capabilities:
hardware-transmit     (SOF_TIMESTAMPING_TX_HARDWARE)
software-transmit     (SOF_TIMESTAMPING_TX_SOFTWARE)
hardware-receive      (SOF_TIMESTAMPING_RX_HARDWARE)
software-receive      (SOF_TIMESTAMPING_RX_SOFTWARE)
software-system-clock (SOF_TIMESTAMPING_SOFTWARE)
hardware-raw-clock    (SOF_TIMESTAMPING_RAW_HARDWARE)
 PTP Hardware Clock: 1
 Hardware Transmit Timestamp Modes:
off                   (HWTSTAMP_TX_OFF)
on                    (HWTSTAMP_TX_ON)
 Hardware Receive Filter Modes:
none                  (HWTSTAMP_FILTER_NONE)
all                   (HWTSTAMP_FILTER_ALL)


The config is:

[global]
slaveOnly 1
summary_interval 6
priority1 255

[ma2]

And running as:
   /usr/sbin/ptp4l -f /etc/ptp4l.conf
   /usr/sbin/phc2sys -a -r -u 64 -n 5

We are running version 1.8, downloaded from the sourceforge mirror. It's
built with openembedede/bitbake and their recipie defines some extra
cflags, I can look iwhy these were deemed to be necessary or if they could
affect anything:
EXTRA_OEMAKE = "'CFLAGS=-D_GNU_SOURCE -DHAVE_CLOCK_ADJTIME
-DHAVE_POSIX_SPAWN -DHAVE_ONESTEP_SYNC'"

I will look into obtaining more verbose logs.

For what it's worth, this exact same setup works elsewhere it is just this
one physical setup that exhibits this, although unclear if the cause a
physical fault or something about the network/master outside.

Additionally, since Ian brought it up
   a) We do sometimes see tx timestamp timeouts too
   b) We also occasionally see UNEXPECTED_SYSWRAP messages from igb
My understanding is that b) is an intel bug (bad per-device assumptions
made in code regarding default state of PPS IRQ) on this HW and seems to be
generally treated as benign. I do have a slight suspicion that a and b may
be somehow related (backing out of the unexpected wrap IRQ 'forgets' to
notice the available tx timestamp being ready?) but I have some digging to
to on that front.

I currently expect (although happy to be proven wrong) that both a) and b)
are unrelated to the clockcheck jumps, since a+b happens readily and
doesn't affect sync *too* badly, whereas constant clockcheck aborts happens
only in one place and is apparently disastrous to sync quality.

Cheers, and thanks for your replies,
Dave

On Wed, Apr 5, 2017 at 1:45 AM, Ian Thompson <ian.thomp...@pgs.com> wrote:

> Possibly following on from David’s post.
>
>
>
> We have a system with 18 boards in a rack, each board has a Altera SoC
> with the STM Ethernet MAC connected via gigabit Ethernet to an Arista
> ptp-aware switch and then a Spectracom GrandMaster.
>
> The boards are running Linux kernel 3.15.0.
>
>
>
> They lock quickly after boot and can remain locked for several hours but
> usually any one of the boards may do the following …
>
>
>
> Apr  4 13:42:04 localhost user.info ptp4l: [537.164] rms  123 max  599
> freq   +255 +/-  39 delay  7362 +/-  48
>
> Apr  4 13:42:29 localhost user.err ptp4l: [561.387] timed out while
> polling for tx timestamp
>
> Apr  4 13:42:29 localhost user.err ptp4l: [561.387] increasing
> tx_timestamp_timeout may correct this issue, but it is likely caused by a
> driver bug
>
> Apr  4 13:42:29 localhost user.err ptp4l: [561.387] port 1: send delay
> request failed
>
> Apr  4 13:42:29 localhost user.notice ptp4l: [561.387] port 1: SLAVE to
> FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
>
> Apr  4 13:42:45 localhost user.notice ptp4l: [577.388] port 1: FAULTY to
> LISTENING on FAULT_CLEARED
>
> Apr  4 13:42:45 localhost user.warn ptp4l: [577.414] clockcheck: clock
> jumped backward or running slower than expected!
>
> Apr  4 13:42:45 localhost user.notice ptp4l: [577.414] port 1: new foreign
> master 000cec.fffe.0a085d-1
>
> Apr  4 13:42:47 localhost user.notice ptp4l: [579.414] selected best
> master clock 000cec.fffe.0a085d
>
> Apr  4 13:42:47 localhost user.notice ptp4l: [579.414] port 1: LISTENING
> to UNCALIBRATED on RS_SLAVE
>
> Apr  4 13:42:54 localhost user.notice ptp4l: [587.164] port 1:
> UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
>
> Apr  4 13:46:46 localhost user.info ptp4l: [818.414] rms 2312500092 max
> 37000001557 freq   +246 +/- 250 delay  7358 +/-  46
>
> Apr  4 13:51:02 localhost user.info ptp4l: [1074.413] rms  116 max  681
> freq   +256 +/-  48 delay  7373 +/-  88
>
>
>
> Does this imply that one lost delay request can do this, or is there a
> retry mechanism?
>
> Notice that the system recovers but we can’t afford the large timing
> glitch that gets introduced.
>
> We have a lot of traffic leaving the boards but only PTP traffic coming
> in. As we increase the off board transfer rates the problem seems to occur
> more often.
>
>
>
> Thanks for any help,
>
> Ian T.
>
>
>
>
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Linuxptp-users mailing list
Linuxptp-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linuxptp-users

Reply via email to