Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes

2017-08-11 Thread Sven Hartge
On 10.08.2017 15:09, Andrew Moore wrote:

> Both of those reports were me. I suspect the issue may be isolated to
> the HPE custom implementation of the ESXi 6.5u1 build. I haven't seen
> any similar reports of people using the vanilla 6.5u1 build.

Not surprising. It wouldn't be the first time HPE horribly botched their
ESX custom ISOs. (Which is the prime reason I don't *ever* use custom
vendor ISOs from any vendor in the first place.)

> Interestingly none of the fixes that have been discussed work with this
> build either. This includes disabling the rx-mini buffer (# ethtool -G
>  rx-mini 0) and adding vmxnet3.rev.30 = FALSE to the VMs vmx
> file.

Very strange, indeed.

> The only way I've managed to restore stability is by removing vmxnet3
> out of the equation completely and changing to the e1000 NIC type.

Using a HW version lower than 13 should also help.


Unfortunately the sample size of people reporting failure or success is
very small at the time, a conclusive result can't be drawn, I am afraid.

Grüße,
Sven.



signature.asc
Description: OpenPGP digital signature


Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes

2017-08-10 Thread Andrew Moore
On Tue, 8 Aug 2017 11:38:06 +0200 (CEST) Sven Hartge 
wrote:
> Um 16:22 Uhr am 03.08.17 schrieb Sven Hartge:
> > On 03.08.2017 15:34, Patrick Matthäi wrote:
> >> Am 16.07.2017 um 23:42 schrieb Ben Hutchings:
> >>> On Thu, 2017-07-06 at 21:50 +0200, Sven Hartge wrote:
>
> > Could this be https://bugzilla.kernel.org/show_bug.cgi?id=191201 ?
> >>> Note that this has been root-caused as a bug in the virtual device,
not
> >>> the driver.  (Though it would be nice if the driver could work around
> >>> it.)
> >
> >> I can confirm, that the VMs do not crash anymore with vSphere 6.5 build
> >> 5969303 from 27.07.2017, that is why I lowered the severity.
> >
> > This is the version from 6.5u1, right?
> >
> > Still: Stretch is basically unusable with HW13 on ESX6.5 below Update1.
>
> Hmm. There are discussions on Reddit right now indicating the bug still
> occurs even with the latest ESXi6.5u1 (Build 5969303).
>
>
https://www.reddit.com/r/homelab/comments/6s5dh6/debian_9_on_esxi_65u1_complete_lockup/
>
> One of the latest comments on the Kernel Bugzilla shows the same:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=191201#c54
>
> (For me, this is really frustrating right now, since I waited until
> ESX6.5u1 before updating my infrastructure and now it seems I have to
push
> this update even farther into the future because of this critical blocker
> bug.)
>
> I really wonder what could be done on the Kernel side to avoid the
> problem, since only newer Kernel are affected while older one don't show
> the problem.
>
> Grüße,
> Sven.
>
>
Hi Sven,

Both of those reports were me. I suspect the issue may be isolated to the
HPE custom implementation of the ESXi 6.5u1 build. I haven't seen any
similar reports of people using the vanilla 6.5u1 build.

Interestingly none of the fixes that have been discussed work with this
build either. This includes disabling the rx-mini buffer (# ethtool -G
 rx-mini 0) and adding vmxnet3.rev.30 = FALSE to the VMs vmx
file.

The only way I've managed to restore stability is by removing vmxnet3 out
of the equation completely and changing to the e1000 NIC type.

Thanks,
Andrew


Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes

2017-08-08 Thread Sven Hartge
Um 16:22 Uhr am 03.08.17 schrieb Sven Hartge:
> On 03.08.2017 15:34, Patrick Matthäi wrote:
>> Am 16.07.2017 um 23:42 schrieb Ben Hutchings:
>>> On Thu, 2017-07-06 at 21:50 +0200, Sven Hartge wrote:
 
> Could this be https://bugzilla.kernel.org/show_bug.cgi?id=191201 ?
>>> Note that this has been root-caused as a bug in the virtual device, not
>>> the driver.  (Though it would be nice if the driver could work around
>>> it.)
> 
>> I can confirm, that the VMs do not crash anymore with vSphere 6.5 build
>> 5969303 from 27.07.2017, that is why I lowered the severity.
> 
> This is the version from 6.5u1, right?
> 
> Still: Stretch is basically unusable with HW13 on ESX6.5 below Update1.

Hmm. There are discussions on Reddit right now indicating the bug still 
occurs even with the latest ESXi6.5u1 (Build 5969303).

https://www.reddit.com/r/homelab/comments/6s5dh6/debian_9_on_esxi_65u1_complete_lockup/

One of the latest comments on the Kernel Bugzilla shows the same:

https://bugzilla.kernel.org/show_bug.cgi?id=191201#c54

(For me, this is really frustrating right now, since I waited until 
ESX6.5u1 before updating my infrastructure and now it seems I have to push 
this update even farther into the future because of this critical blocker 
bug.)

I really wonder what could be done on the Kernel side to avoid the 
problem, since only newer Kernel are affected while older one don't show 
the problem.

Grüße,
Sven.



Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes

2017-08-03 Thread Sven Hartge
On 03.08.2017 15:34, Patrick Matthäi wrote:
> Am 16.07.2017 um 23:42 schrieb Ben Hutchings:
>> On Thu, 2017-07-06 at 21:50 +0200, Sven Hartge wrote:

 Could this be https://bugzilla.kernel.org/show_bug.cgi?id=191201 ?
>> Note that this has been root-caused as a bug in the virtual device, not
>> the driver.  (Though it would be nice if the driver could work around
>> it.)

> I can confirm, that the VMs do not crash anymore with vSphere 6.5 build
> 5969303 from 27.07.2017, that is why I lowered the severity.

This is the version from 6.5u1, right?

Still: Stretch is basically unusable with HW13 on ESX6.5 below Update1.

Grüße,
Sven.





signature.asc
Description: OpenPGP digital signature


Processed: Re: Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes

2017-08-03 Thread Debian Bug Tracking System
Processing commands for cont...@bugs.debian.org:

> severity #864642 normal
Bug #864642 [src:linux] vmxnet3: Reports suspect GRO implementation on vSphere 
hosts / one VM crashes
Severity set to 'normal' from 'important'
> thanks
Stopping processing here.

Please contact me if you need assistance.
-- 
864642: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=864642
Debian Bug Tracking System
Contact ow...@bugs.debian.org with problems



Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes

2017-08-03 Thread Patrick Matthäi
severity #864642 normal
thanks


Am 16.07.2017 um 23:42 schrieb Ben Hutchings:
> Control: tag -1 moreinfo
>
> Sven asked this, but forgot to add you to the recipients:
>
> On Thu, 2017-07-06 at 21:50 +0200, Sven Hartge wrote:
>> Hi!
>>
>>> Could this be https://bugzilla.kernel.org/show_bug.cgi?id=191201 ?
> Note that this has been root-caused as a bug in the virtual device, not
> the driver.  (Though it would be nice if the driver could work around
> it.)
>
> Ben.

I can confirm, that the VMs do not crash anymore with vSphere 6.5 build
5969303 from 27.07.2017, that is why I lowered the severity.

But we have got still the issue with "Driver has suspect GRO
implementation, TCP performance may be compromised" and the fact, that
4.9.18-1 wasn't crashing and has not this message, while 4.9.30-1 was
crashing with the message.

-- 
/*
Mit freundlichem Gruß / With kind regards,
 Patrick Matthäi
 GNU/Linux Debian Developer

  Blog: http://www.linux-dev.org/
E-Mail: pmatth...@debian.org
patr...@linux-dev.org
*/




signature.asc
Description: OpenPGP digital signature


Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes

2017-07-16 Thread Ben Hutchings
Control: tag -1 moreinfo

Sven asked this, but forgot to add you to the recipients:

On Thu, 2017-07-06 at 21:50 +0200, Sven Hartge wrote:
> Hi!
> 
> > Could this be https://bugzilla.kernel.org/show_bug.cgi?id=191201 ?

Note that this has been root-caused as a bug in the virtual device, not
the driver.  (Though it would be nice if the driver could work around
it.)

Ben.

> Try the following, from comment 37 
> https://bugzilla.kernel.org/show_bug.cgi?id=191201#c37
> 
> > In the meantime, suggested workaround:
> >  - disable rx data ring: ethtool -G eth? rx-mini 0
> 
> Also adding "vmxnet3.rev.30 = FALSE" to the vmx file of the VM seems to 
> be needed. https://bugzilla.kernel.org/show_bug.cgi?id=191201#c40
> 
> Also: Which hardware version are you running? It is v10 for me (highest 
> for ESX5.5)

-- 
Ben Hutchings
If the facts do not conform to your theory, they must be disposed of.



signature.asc
Description: This is a digitally signed message part


Processed: Re: Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes

2017-07-16 Thread Debian Bug Tracking System
Processing control commands:

> tag -1 moreinfo
Bug #864642 [src:linux] vmxnet3: Reports suspect GRO implementation on vSphere 
hosts / one VM crashes
Added tag(s) moreinfo.

-- 
864642: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=864642
Debian Bug Tracking System
Contact ow...@bugs.debian.org with problems



Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes

2017-07-13 Thread Patrick Matthäi
forwarded #864642 https://bugzilla.kernel.org/show_bug.cgi?id=191201
thanks


Am 11.07.2017 um 10:24 schrieb Patrick Matthäi:
> found #864642 4.9.30-2+deb9u2
> thanks
>
> And it still crashes..
>
>
> Am 22.06.2017 um 14:58 schrieb Patrick Matthäi:
>> found #864642 4.9.30-2+deb9u1
>> thanks
>>
>>
>> Am 12.06.2017 um 10:02 schrieb Patrick Matthäi:
>>> Package: src:linux
>>> Version: 4.9.30-1
>>> Severity: important
>>> File: linux
>>>
>>> Dear Maintainer,
>>>
>>> *** Reporter, please consider answering these questions, where
>>> appropriate ***
>>>
>>> Since updating the kernel from linux-image-4.9.0-2-amd64 (4.9.18-1) to
>>> linux-image-4.9.0-3-amd64 (4.9.30-1) all VMs report - just for the
>>> "primary" interface this:
>>>
>>> TCP: ens192: Driver has suspect GRO implementation, TCP performance may
>>> be compromised.
>>>
>>> I can't see any performance impact. This happens on all our vSphere 6.0
>>> and 6.5 hosts (running on HPE ProLiant DL 360 G8 - G9 HW / ProLiant ML
>>> 350 G9 and so on).
>>>
>>> Why is this bug important? Because on one VM this also produces a kernel
>>> panic after some time (minutes or hours). I just could get the panic
>>> attached as screenshot. The only "big" difference between the crashing
>>> host and the others may be, that it is also running PM2, NPM, NodeJS and
>>> a NFS kernel server.
>>>
>>> If I boot the VM with 4.9.30-1 and deactivate gro and lro with:
>>> ethtool -K ens192 gro off
>>> ethtool -K ens192 lro off
>>> .. it does not crash.
>>>
>>> Booting 4.9.18-1 and everything is completly fine ;)
>>>
>> The VM keeps on crashing after a few hours
>>

-- 
/*
Mit freundlichem Gruß / With kind regards,
 Patrick Matthäi
 GNU/Linux Debian Developer

  Blog: http://www.linux-dev.org/
E-Mail: pmatth...@debian.org
patr...@linux-dev.org
*/



Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes

2017-07-11 Thread Patrick Matthäi
found #864642 4.9.30-2+deb9u2
thanks

And it still crashes..


Am 22.06.2017 um 14:58 schrieb Patrick Matthäi:
> found #864642 4.9.30-2+deb9u1
> thanks
>
>
> Am 12.06.2017 um 10:02 schrieb Patrick Matthäi:
>> Package: src:linux
>> Version: 4.9.30-1
>> Severity: important
>> File: linux
>>
>> Dear Maintainer,
>>
>> *** Reporter, please consider answering these questions, where
>> appropriate ***
>>
>> Since updating the kernel from linux-image-4.9.0-2-amd64 (4.9.18-1) to
>> linux-image-4.9.0-3-amd64 (4.9.30-1) all VMs report - just for the
>> "primary" interface this:
>>
>> TCP: ens192: Driver has suspect GRO implementation, TCP performance may
>> be compromised.
>>
>> I can't see any performance impact. This happens on all our vSphere 6.0
>> and 6.5 hosts (running on HPE ProLiant DL 360 G8 - G9 HW / ProLiant ML
>> 350 G9 and so on).
>>
>> Why is this bug important? Because on one VM this also produces a kernel
>> panic after some time (minutes or hours). I just could get the panic
>> attached as screenshot. The only "big" difference between the crashing
>> host and the others may be, that it is also running PM2, NPM, NodeJS and
>> a NFS kernel server.
>>
>> If I boot the VM with 4.9.30-1 and deactivate gro and lro with:
>> ethtool -K ens192 gro off
>> ethtool -K ens192 lro off
>> .. it does not crash.
>>
>> Booting 4.9.18-1 and everything is completly fine ;)
>>
> The VM keeps on crashing after a few hours
>

-- 
/*
Mit freundlichem Gruß / With kind regards,
 Patrick Matthäi
 GNU/Linux Debian Developer

  Blog: http://www.linux-dev.org/
E-Mail: pmatth...@debian.org
patr...@linux-dev.org
*/



Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes

2017-07-06 Thread Sven Hartge
Hi!

Could this be https://bugzilla.kernel.org/show_bug.cgi?id=191201 ?

Try the following, from comment 37 
https://bugzilla.kernel.org/show_bug.cgi?id=191201#c37

| In the meantime, suggested workaround:
|  - disable rx data ring: ethtool -G eth? rx-mini 0

Also adding "vmxnet3.rev.30 = FALSE" to the vmx file of the VM seems to 
be needed. https://bugzilla.kernel.org/show_bug.cgi?id=191201#c40

Also: Which hardware version are you running? It is v10 for me (highest 
for ESX5.5)

Grüße,
Sven.



Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes

2017-07-06 Thread Sven Hartge
On Mon, 12 Jun 2017 10:02:56 +0200 =?UTF-8?Q?Patrick_Matth=c3=a4i?=
 wrote:

> Since updating the kernel from linux-image-4.9.0-2-amd64 (4.9.18-1) to
> linux-image-4.9.0-3-amd64 (4.9.30-1) all VMs report - just for the
> "primary" interface this:
> 
> TCP: ens192: Driver has suspect GRO implementation, TCP performance may
> be compromised.
> 
> I can't see any performance impact. This happens on all our vSphere 6.0
> and 6.5 hosts (running on HPE ProLiant DL 360 G8 - G9 HW / ProLiant ML
> 350 G9 and so on).

I see the same for my Stretch Test VMs, running on ESXi 5.5 on Dell R720.

I have yet to experience a kernel panic, but those VMs are mostly idle
and don't transfer many bytes via network, so the crash-intensity might
be related to the amount of data transmitted or the peak throughput at
some time.

Grüße,
Sven.



signature.asc
Description: OpenPGP digital signature


Bug#864642: vmxnet3: Reports suspect GRO implementation on vSphere hosts / one VM crashes

2017-06-22 Thread Patrick Matthäi
found #864642 4.9.30-2+deb9u1
thanks


Am 12.06.2017 um 10:02 schrieb Patrick Matthäi:
> Package: src:linux
> Version: 4.9.30-1
> Severity: important
> File: linux
>
> Dear Maintainer,
>
> *** Reporter, please consider answering these questions, where
> appropriate ***
>
> Since updating the kernel from linux-image-4.9.0-2-amd64 (4.9.18-1) to
> linux-image-4.9.0-3-amd64 (4.9.30-1) all VMs report - just for the
> "primary" interface this:
>
> TCP: ens192: Driver has suspect GRO implementation, TCP performance may
> be compromised.
>
> I can't see any performance impact. This happens on all our vSphere 6.0
> and 6.5 hosts (running on HPE ProLiant DL 360 G8 - G9 HW / ProLiant ML
> 350 G9 and so on).
>
> Why is this bug important? Because on one VM this also produces a kernel
> panic after some time (minutes or hours). I just could get the panic
> attached as screenshot. The only "big" difference between the crashing
> host and the others may be, that it is also running PM2, NPM, NodeJS and
> a NFS kernel server.
>
> If I boot the VM with 4.9.30-1 and deactivate gro and lro with:
> ethtool -K ens192 gro off
> ethtool -K ens192 lro off
> .. it does not crash.
>
> Booting 4.9.18-1 and everything is completly fine ;)
>

The VM keeps on crashing after a few hours

-- 
/*
Mit freundlichem Gruß / With kind regards,
 Patrick Matthäi
 GNU/Linux Debian Developer

  Blog: http://www.linux-dev.org/
E-Mail: pmatth...@debian.org
patr...@linux-dev.org
*/