Hi Jacob !

Sorry I was mistaking, I meant Red Hat instead of Fedora. The kernel version I 
am running is 3.10.0.

After the first timeout indeed more timeouts occur, the offset shoots to 36 
seconds (always) (which makes me think it makes a mistake in handling the 
TAI/UTC conversion) and after a while of returning timeout messages the offset 
just drifts away and away (higher offset).

Jord

On 7 Aug 2018, at 20:41, Keller, Jacob E 
<jacob.e.kel...@intel.com<mailto:jacob.e.kel...@intel.com>> wrote:

-----Original Message-----
From: Jord Pool [mailto:jord.p...@outlook.com]
Sent: Tuesday, August 07, 2018 2:21 AM
To: Richard Cochran <richardcoch...@gmail.com<mailto:richardcoch...@gmail.com>>
Cc: Keller, Jacob E 
<jacob.e.kel...@intel.com<mailto:jacob.e.kel...@intel.com>>; Cliff Spradlin
<csprad...@waymo.com<mailto:csprad...@waymo.com>>; Chris Caudle 
<ch...@chriscaudle.org<mailto:ch...@chriscaudle.org>>; Cliff Spradlin via
Linuxptp-users 
<linuxptp-users@lists.sourceforge.net<mailto:linuxptp-users@lists.sourceforge.net>>
Subject: Re: PXE Boot PTP Issues

Hi Richard,

It is not per se PXE, but network load in general. When PXE booting other 
servers,
the PXE boot server which runs as a PTP slave will have a high load of network
traffic going out to the servers that are about to boot through PXE.

This high network load causes the PTP slave instance to return the message
telling to increase the tx_timestamp_timeout value or it being a driver bug.

To be sure it has nothing to do with PXE in specific, when copying an .iso file 
of
~5GB over the Ethernet connections with the maximum gigabit speeds of +-
120MB/s, the PTP slave instance will stop and returns the same
tx_timestamp_timeout message. This indicates clearly that high network load
causes PTP to stop working, at least with the e1000e driver.

The weird part at least is that PTP does not recover itself anymore after being 
set
on hold for a minute when the tx_timestamp_timeout message appears. This
completely vanishes the point of synchronising time that when network load
increases the synchronisation process stops and only drifts further away instead
of re-synchronising.

The driver is the e1000e version 3.2.6, which is the default of Fedora 22. I 
have
also tried versions 3.4.0.2 adn 3.4.1.1 of the e1000e driver but they also don’t
seem to work.

Jord



Oh! hmmmmmm. This sounds suspiciously familiar..... What version of the kernel 
is your fedora running? I think I recall a fix upstream that might be 
related... and it's quite possible the team that owns the sourceforge driver 
never released the fix into that driver...

The fix wasn't released until 4.13, it's commit 5012863b7347 ("e1000e: fix race 
condition around skb_tstamp_tx()", 2017-06-06)

I don't know for sure if this fix would resolve your issue or not, but it seems 
related. The way timestamps were handled, there was a race such that we would 
ignore some timestamp requests from the application.

What's the exact behavior you see after the first timeout? Do you keep seeing 
more timeouts? I'm curious what other behavior you see.. You might also check 
the ethtool stats to see if any of the timestamp statistics are incrementing, 
as this might help indicate the problem.

It's possible there is still some race condition in that driver that is causing 
failures to cascade.

Thanks,
Jake

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Linuxptp-users mailing list
Linuxptp-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linuxptp-users

Reply via email to