Daniel Braniss wrote:
>
> > On 24 Aug 2015, at 10:22, Hans Petter Selasky <[email protected]> wrote:
> >
> > On 08/24/15 01:02, Rick Macklem wrote:
> >> The other thing is the degradation seems to cut the rate by about half
> >> each time.
> >> 300-->150-->70 I have no idea if this helps to explain it.
> >
> > Might be a NUMA binding issue for the processes involved.
> >
> > man cpuset
> >
> > --HPS
>
> I can’t see how this is relevant, given that the same host, using the
> mellanox/mlxen
> behave much better.
Well, the "ix" driver has a bunch of tunables for things like "number of queues"
and although I'll admit I don't understand how these queues are used, I think
they are related to CPUs and their caches. There is also something called
IXGBE_FDIR,
which others have recommended be disabled. (The code is #ifdef IXGBE_FDIR, but
I don't
know if it defined for your kernel?) There are also tunables for interrupt rate
and
something called hw.ixgbe_tx_process_limit, which appears to limit the number
of packets
to send or something like that?
(I suspect Hans would understand this stuff much better than I do, since I
don't understand
it at all.;-)
At a glance, the mellanox driver looks very different.
> I’m getting different results with the intel/ix depending who is the nfs
> server
>
Who knows until you figure out what is actually going on. It could just be the
timing of
handling the write RPCs or when the different servers send acks for the TCP
segments or ...
that causes this for one server and not another.
One of the principals used when investigating airplane accidents is to "never
assume anything"
and just try to collect the facts until the pieces of the puzzle fall in place.
I think the
same principal works for this kind of stuff.
I once had a case where a specific read of one NFS file would fail on certain
machines.
I won't bore you with the details, but after weeks we got to the point where we
had a lab
of identical machines (exactly the same hardware and exactly the same software
loaded on them)
and we could reproduce this problem on about half the machines and not the
other half. We
(myself and the guy I worked with) finally noticed the failing machines were on
network ports
for a given switch. We moved the net cables to another switch and the problem
went away.
--> This particular network switch was broken in such a way that it would
garble one specific
packet consistently, but worked fine for everything else.
My point here is that, if someone had suggested the "network switch might be
broken" at the
beginning of investigating this, I would have probably dismissed it, based on
"the network is
working just fine", but in the end, that was the problem.
--> I am not suggesting you have a broken network switch, just "don't take
anything off the
table until you know what is actually going on".
And to be honest, you may never know, but it is fun to try and solve these
puzzles.
Beyond what I already suggested, I'd look at the "ix" driver's stats and
tunables and
see if any of the tunables has an effect. (And, yes, it will take time to work
through these.)
Good luck with it, rick
>
> danny
>
> _______________________________________________
> [email protected] mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "[email protected]"
_______________________________________________
[email protected] mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[email protected]"