[Bug 225791] ena driver causing kernel panics on AWS EC2

2020-05-12 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #37 from commit-h...@freebsd.org ---
A commit references this bug:

Author: mw
Date: Tue May 12 18:44:41 UTC 2020
New revision: 360985
URL: https://svnweb.freebsd.org/changeset/base/360985

Log:
  MFC r360777: Optimize ENA Rx refill for low memory conditions

  Sometimes, especially when there is not much memory in the system left,
  allocating mbuf jumbo clusters (like 9KB or 16KB) can take a lot of time
  and it is not guaranteed that it'll succeed. In that situation, the
  fallback will work, but if the refill needs to take a place for a lot of
  descriptors at once, the time spent in m_getjcl looking for memory can
  cause system unresponsiveness due to high priority of the Rx task. This
  can also lead to driver reset, because Tx cleanup routine is being
  blocked and timer service could detect that Tx packets aren't cleaned
  up. The reset routine can further create another unresponsiveness - Rx
  rings are being refilled there, so m_getjcl will again burn the CPU.
  This was causing NVMe driver timeouts and resets, because network driver
  is having higher priority.

  Instead of 16KB jumbo clusters for the Rx buffers, 9KB clusters are
  enough - ENA MTU is being set to 9K anyway, so it's very unlikely that
  more space than 9KB will be needed.

  However, 9KB jumbo clusters can still cause issues, so by default the
  page size mbuf cluster will be used for the Rx descriptors. This can have a
  small (~2%) impact on the throughput of the device, so to restore
  original behavior, one must change sysctl "hw.ena.enable_9k_mbufs" to
  "1" in "/boot/loader.conf" file.

  As a part of this patch (important fix), the version of the driver
  was updated to v2.1.2.

  Submitted by:   cperciva
  PR: 225791, 234838, 235856, 236989, 243531

Changes:
_U  stable/12/
  stable/12/sys/dev/ena/ena.c
  stable/12/sys/dev/ena/ena.h

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2020-05-07 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #36 from commit-h...@freebsd.org ---
A commit references this bug:

Author: mw
Date: Thu May  7 11:28:40 UTC 2020
New revision: 360777
URL: https://svnweb.freebsd.org/changeset/base/360777

Log:
  Optimize ENA Rx refill for low memory conditions

  Sometimes, especially when there is not much memory in the system left,
  allocating mbuf jumbo clusters (like 9KB or 16KB) can take a lot of time
  and it is not guaranteed that it'll succeed. In that situation, the
  fallback will work, but if the refill needs to take a place for a lot of
  descriptors at once, the time spent in m_getjcl looking for memory can
  cause system unresponsiveness due to high priority of the Rx task. This
  can also lead to driver reset, because Tx cleanup routine is being
  blocked and timer service could detect that Tx packets aren't cleaned
  up. The reset routine can further create another unresponsiveness - Rx
  rings are being refilled there, so m_getjcl will again burn the CPU.
  This was causing NVMe driver timeouts and resets, because network driver
  is having higher priority.

  Instead of 16KB jumbo clusters for the Rx buffers, 9KB clusters are
  enough - ENA MTU is being set to 9K anyway, so it's very unlikely that
  more space than 9KB will be needed.

  However, 9KB jumbo clusters can still cause issues, so by default the
  page size mbuf cluster will be used for the Rx descriptors. This can have a
  small (~2%) impact on the throughput of the device, so to restore
  original behavior, one must change sysctl "hw.ena.enable_9k_mbufs" to
  "1" in "/boot/loader.conf" file.

  As a part of this patch (important fix), the version of the driver
  was updated to v2.1.2.

  Submitted by:   cperciva
  Reviewed by:Michal Krawczyk 
  Reviewed by:Ido Segev 
  Reviewed by:Guy Tzalik 
  MFC after:  3 days
  PR: 225791, 234838, 235856, 236989, 243531
  Differential Revision: https://reviews.freebsd.org/D24546

Changes:
  head/sys/dev/ena/ena.c
  head/sys/dev/ena/ena.h
  head/sys/dev/ena/ena_sysctl.c
  head/sys/dev/ena/ena_sysctl.h

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2020-04-22 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #35 from Colin Percival  ---
I believe that this patch should fix the underlying problem, which is in the
ENA driver: https://reviews.freebsd.org/D24546

If you're able to build a custom kernel, please test that patch and report
results in that review or via email (cperciva@).

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2019-01-10 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #34 from Leif Pedersen  ---
(In reply to Colin Percival from comment #32)
You bet. Sorry, I was away for a few days. I opened
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=234838

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2019-01-08 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #33 from Mike Walker  ---
I'm experiencing packet loss w/ENA & FreeBSD 12.0, relevant bug report here:
bug #234754

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2019-01-04 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #32 from Colin Percival  ---
Leif, could you open a new PR for that and CC me?  I'll get some people to look
at it but I think it's an unrelated issue so I don't want to force them to wade
through this entire thread.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2019-01-02 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #30 from Richard Paul  ---
You'll need to try this out on 12.0 berend I think will be the response.

We have completed our migration to GCP, from AWS, now so can't make any more
progress on this.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-12-21 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #29 from ber...@pobox.com ---
This is on 11.2-RELEASE-p7.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-12-21 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

ber...@pobox.com changed:

   What|Removed |Added

 CC||ber...@pobox.com

--- Comment #28 from ber...@pobox.com ---
Seeing exactly the same thing on m5.large. 100% repeatable (full zfs send/recv
from another server).

Dec 21 21:07:10 nfs1 kernel: Fatal trap 12: page fault while in kernel mode
Dec 21 21:07:10 nfs1 kernel: cpuid = 0; apic id = 00
Dec 21 21:07:10 nfs1 kernel: fault virtual address  = 0x1c
Dec 21 21:07:10 nfs1 kernel: fault code = supervisor write data, page
not present
Dec 21 21:07:10 nfs1 kernel: instruction pointer=
0x20:0x82269f5c
Dec 21 21:07:10 nfs1 kernel: stack pointer  =
0x0:0xfe02259ac180
Dec 21 21:07:10 nfs1 kernel: frame pointer  =
0x0:0xfe02259ac260
Dec 21 21:07:10 nfs1 kernel: code segment   = base 0x0, limit
0xf, type 0x1b
Dec 21 21:07:10 nfs1 kernel: = DPL 0, pres 1, long 1, def32 0, gran 1
Dec 21 21:07:10 nfs1 kernel: processor eflags   = interrupt enabled, resume,
IOPL = 0
Dec 21 21:07:10 nfs1 kernel: current process= 12 (irq260: ena0)
Dec 21 21:07:10 nfs1 kernel: trap number= 12
Dec 21 21:07:10 nfs1 kernel: panic: page fault
Dec 21 21:07:10 nfs1 kernel: cpuid = 0
Dec 21 21:07:10 nfs1 kernel: KDB: stack backtrace:
Dec 21 21:07:10 nfs1 kernel: #0 0x80b3d577 at kdb_backtrace+0x67
Dec 21 21:07:10 nfs1 kernel: #1 0x80af6b17 at vpanic+0x177
Dec 21 21:07:10 nfs1 kernel: #2 0x80af6993 at panic+0x43
Dec 21 21:07:10 nfs1 kernel: #3 0x80f77fdf at trap_fatal+0x35f
Dec 21 21:07:10 nfs1 kernel: #4 0x80f78039 at trap_pfault+0x49
Dec 21 21:07:10 nfs1 kernel: #5 0x80f77807 at trap+0x2c7
Dec 21 21:07:10 nfs1 kernel: #6 0x80f5808c at calltrap+0x8
Dec 21 21:07:10 nfs1 kernel: #7 0x80abcd69 at
intr_event_execute_handlers+0xe9
Dec 21 21:07:10 nfs1 kernel: #8 0x80abd047 at ithread_loop+0xe7
Dec 21 21:07:10 nfs1 kernel: #9 0x80aba093 at fork_exit+0x83
Dec 21 21:07:10 nfs1 kernel: #10 0x80f58fae at fork_trampoline+0xe
Dec 21 21:07:10 nfs1 kernel: Uptime: 11h49m3s
Dec 21 21:07:10 nfs1 kernel: Rebooting...

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-10-31 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #27 from Richard Paul  ---
@jaehak

That is not this issue and has been rectified already in v12.  v12 should be
out in December and this problem will go away for you (and us we're actually
seeing disruption on our production systems because of this bug as the
application unexpectedly can't reach the cache and database layers when the
network interface is down and we're hit pretty hard by this because we receive
1M odd requests per day so there's a lot of opportunity for this to happen.)

I did some testing yesterday but I couldn't manage to reproduce the issue on
either 11.2 or 12.0 Beta-1; however, the problem does still exist on the
current 11.2 release because I had a test instance with jails on it that I was
building our application stack in it (it's a convoluted stack with a lot of
files being uploaded to S3 as part of the build) and I'd been having issues
with it rebooting but yesterday it failed on startup as it wanted to drop into
single user mode due to a UFS checksum issue.  Obviously this isn't possible on
AWS as you don't get console access so this instance had to be written off.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-10-31 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

jaehak  changed:

   What|Removed |Added

 CC||cran...@gmail.com

--- Comment #26 from jaehak  ---
I have same problem(ena interface going down and up repeatedly).
# uname -a
FreeBSD db-20 11.2-RELEASE-p4 FreeBSD 11.2-RELEASE-p4 #0: Thu Sep 27 08:16:24
UTC 2018 r...@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC 
amd64

# ifconfig ena0
ena0: flags=8843 metric 0 mtu 1500
options=422
ether 06:4d:4b:64:e1:86
hwaddr 06:4d:4b:64:e1:86
inet6 fe80::44d:4bff:fe64:e186%ena0 prefixlen 64 scopeid 0x1
inet 10.1.20.20 netmask 0xff00 broadcast 10.1.20.255
nd6 options=23
media: Ethernet autoselect (10Gbase-T )
status: active


AWS r5.large instance.
It was 11.1 release. I upgraded to 11.2 with freebsd-update.

But, my another instance is very stable
r4.large
# uname -a
FreeBSD web-10 11.1-RELEASE-p1 FreeBSD 11.1-RELEASE-p1 #0: Wed Aug  9 11:55:48
UTC 2017 r...@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC 
amd64

# ifconfig ena0
ena0: flags=8843 metric 0 mtu 1500
options=422
ether 06:01:57:54:03:a2
hwaddr 06:01:57:54:03:a2
inet6 fe80::401:57ff:fe54:3a2%ena0 prefixlen 64 scopeid 0x1
inet 10.1.20.10 netmask 0xff00 broadcast 10.1.20.255
nd6 options=23
media: Ethernet autoselect (10Gbase-T )
status: active

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-10-29 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #25 from Richard Paul  ---
This has been sat on my to do list for a while.  I'm hoping that if I can get
my next job out of the way this week I'll revisit this.

Further to my previous posts we seem to see that on instances faced with
moderate memory pressure whilst also seeing reasonable amounts of writing to
disks that are running on ZFS datasets the reboots seem to happen more
regularly.  

We don't see this just where there is memory pressure, e.g. Varnish servers
which are running purely within memory we haven't seen this even though the
memory usage is very close to 100%.
If we double the memory on a crashing instance, the issue goes away. 


As such I'm going to attempt to force memory pressure on a test server with an
additional disk with a  zpool and zfs dataset to attempt to reproduce this on a
recent 12.0 instance.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-09-11 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #24 from Alex Dupre  ---
A bit OT, but is there a particular reason for the FreeBSD 11.2 AMI to support
C5 instances and not M5?

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-09-10 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #23 from p...@nomadlogic.org ---
I have a c5.large ec2 instance i'm running to test this using 12.0-ALPHA5.  In
the same VPC I have a system using a xn ethernet interface.  I am running
iperf3 between these two systems and getting just shy of 1Gbs network
throughput, and ~13Kp/s.

The c5.large system with ena interfaces has not had any problems so far.  I've
run several iperf3 TCP tests for 10mins each with no errors.  As mentioned
earlier, the interface flapping errors have gone away as well.

If there are other artificial benchmarks that I should run to help validate
this configuration has stabilized let me know and I can run them today.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-09-10 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #22 from Colin Percival  ---
(In reply to Ling from comment #21)

I think the up/down state flapping is unrelated to the panics other people were
seeing, so I'd like to know if other people can reproduce the issues they saw.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-09-09 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #21 from Ling  ---
(In reply to Colin Percival from comment #15)
I tested on 12.0-alpha2 and 12.0-alpha5 on c5.large and t3.micro and did not
see any ena up and down message again.
so I think this issue has been fixed.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-09-09 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #20 from Leif Pedersen  ---
(In reply to Colin Percival from comment #19)

Cool. I may be able to clone that machine to 12 later this week and try to
reproduce it...I need to finish some urgent work first.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-09-09 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #19 from Colin Percival  ---
The reason I was asking about HEAD is that we're currently at 12.0-ALPHA5 --
we're going to have 12.0-RELEASE before the release engineering team goes back
and does the next release from stable/11 (aka. 11.3-RELEASE).

In other words, I'd like to make sure this is fixed in the next release, but
the first step towards that is knowing if it's still broken.  There have been
some driver updates since 11.2 and one of them might have fixed this
accidentally.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-09-08 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

Leif Pedersen  changed:

   What|Removed |Added

 CC||l...@ofwilsoncreek.com

--- Comment #18 from Leif Pedersen  ---
(In reply to pete from comment #16)

I've been able to reproduce this repeatedly (but not predictably) on 11.2 on an
r4.large. Not to state the blindingly obvious, but smaller instances such as
t2.* aren't affected since they use xn instead of ena. It seems to be most
likely at times of high network IO, which again risks stating the
forehead-slappingly obvious. :)

Multiple times, the crash included the same back-trace shown in this bug.
However, at least once it panicked on a double-fault, which, if related,
suggests that the bug in ena could be incurring memory corruption. Now granted,
I only know of one incidence of a double-fault, so it could've been running on
a host with faulty RAM or something at the time. However, after each panic, I'd
stop/start the instance rather than reboot, to provoke it to move to new
hardware, so I'm not suggesting that the whole bug is merely from faulty host
hardware.

I might beg that the fix could be patched in 11.2, or at least included in 11.3
so it won't have to wait for 12. Otherwise, AWS users will find themselves
stuck on 11.1, and the approaching EOL of 11.1 will leave them without security
updates, which in turn makes this an indirect security issue. However, I
understand there are other considerations at play, and very much appreciate the
relentless work of the security team (not to mention the work on AWS support
and FreeBSD in general).

Probably too much detail: The particular case was our standby MySQL database on
an r4.large. It was stable on 11.1, and problematic after I upgraded it to 11.2
(with `freebsd-update upgrade`); after five or so crashes in a month, I
downgraded it back to 11.1 (again with `freebsd-update upgrade`), after which
it has been perfectly stable for a couple of weeks now. It's in master-master
replication with our production replica, and normally gets a fairly low but
steady stream of activity from the replication. However, we have several
nightly jobs that crank away on updating a model and cause a large volume of
traffic in the replication stream. I don't have proper metrics on bytes/sec, so
I don't have any idea whether it saturates the interface. It's enough that
replication falls behind for up to a few hours, but I wouldn't call our system
"huge" in terms of network traffic by any means.

The reason I included all that detail is to point out: (1) it seems to be a
regression between 11.1 and 11.2, (2) r4.* are for sure affected, and (3) it
may be that the problem is more likely to be triggered on moderate or bursty
network traffic with much task-switching between MySQL threads, compared to a
simple stream of a high speed file transfer, for example.

-Leif

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-09-07 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

p...@nomadlogic.org changed:

   What|Removed |Added

 CC||p...@nomadlogic.org

--- Comment #17 from p...@nomadlogic.org ---
(In reply to Colin Percival from comment #16)
I will try to reproduce later today, or this weekend.  I was able to reproduce
about a month ago IIRC, but will test with latest 12-CURRENT checkout.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-09-07 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #16 from Colin Percival  ---
Can anyone reproduce this on HEAD?  If this is still broken I'd like to make
sure it's fixed before 12.0-RELEASE, but so far this seems quite elusive.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-08-02 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #15 from Colin Percival  ---
ENA flapping every 30 minutes is almost certainly due to the MTU being set
thanks to DHCP announcing support for jumbograms.  That particular bug is fixed
in HEAD (r333454).

AFAIK this should not cause any of the other reported issues, but it would be
good if someone who is experiencing problems can confirm that they don't happen
at 30 minute intervals.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-08-02 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

Ling  changed:

   What|Removed |Added

 CC||i...@gamesofa.com

--- Comment #14 from Ling  ---
Hi, I'm running 11.2-RELEASE on ec2 singapore. while, no panic encountered yet,
 ENA keeps going down and up every EXACTLY 30 mins and causes about .5% packet
loss.

on server 1 c5.2xlarge, traffic is 50Mbps,
Aug  2 00:24:40 ip-10-251-18-192 kernel: ena0: device is going DOWN
Aug  2 00:24:40 ip-10-251-18-192 kernel: ena0: device is going UP
Aug  2 00:54:40 ip-10-251-18-192 kernel: ena0: device is going DOWN
Aug  2 00:54:40 ip-10-251-18-192 kernel: ena0: device is going UP
Aug  2 01:24:41 ip-10-251-18-192 kernel: ena0: device is going DOWN
Aug  2 01:24:41 ip-10-251-18-192 kernel: ena0: device is going UP
Aug  2 01:54:40 ip-10-251-18-192 kernel: ena0: device is going DOWN
Aug  2 01:54:41 ip-10-251-18-192 kernel: ena0: device is going UP

on server 2 c5.large, traffic is <1Mbps
Aug  2 00:18:00 proxy621 kernel: ena0: device is going DOWN
Aug  2 00:18:00 proxy621 kernel: ena0: device is going UP
Aug  2 00:48:00 proxy621 kernel: ena0: device is going DOWN
Aug  2 00:48:00 proxy621 kernel: ena0: device is going UP
Aug  2 01:18:00 proxy621 kernel: ena0: device is going DOWN
Aug  2 01:18:00 proxy621 kernel: ena0: device is going UP
Aug  2 01:48:00 proxy621 kernel: ena0: device is going DOWN
Aug  2 01:48:00 proxy621 kernel: ena0: device is going UP


grep ena /var/log/messages is here
server 1:
https://nopaste.xyz/?ac03ff403e167965#pg6GYMdb+yReKI4OFiR7vmXqVy7fCsYI5e9TX2hdqTA=
server 2:
https://nopaste.xyz/?4b43d08c79c5bc32#gIcXQRyZTFZ0e7M9aW8NQQLatv78UBD3p6Gu7ZQ0QPs=

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-07-03 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #12 from Colin Percival  ---
Sadly nvme hotplug/unplug is still broken in 11.2 -- unfortunately it turned
out that some of the people who would have been fixing that were also the
people who needed to work on fixing the Spectre/Meltdown/etc. issues so this
got pushed off.  Right now I'm hoping that we'll have the bugs worked out in
time for 12.0.

The extent of the testing I've done is a few buildworlds on a single disk --
I've been busy chasing other issues (e.g., the IPI issue which was causing
userland data corruption) so I haven't been able to do much testing here.  Any
you can do will be much appreciated...

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-07-03 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #11 from Richard Paul  ---
Hi Colin,

I read the article about these drives earlier this year thanks, (p.s. is this
more usable now in 11.2 or are we still waiting on an ability to hot remove
drives), specifically on this test instance, no, I haven't we did do this on
the original C5 server when we wanted to replace an IO2 drive which was costing
us a fortune, and had to schedule some down time to detach it.

But in terms of the instance that I'm testing no, I didn't mess about with the
volumes.

However, thinking about this, have you tried testing with a larger number of
EBS volumes attached? And tested with load going to most of them at the same
time?
We have, UFS root vol, 2x mirrored ZFS for DB, 1x ZFS for logs, 1x ZFS local
backups vol.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-07-02 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #10 from Colin Percival  ---
Have you been attaching/detaching EBS volumes while your [mc]5 instances are
running?  AFAIK the nvme driver is completely stable aside from that.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-07-02 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #9 from Richard Paul  ---
So, I managed to produce sufficient load to force this to happen on an m5.large
instance.  I have tried to replicate this on a r4.large instance and have
failed to do so so far but I will keep trying.

As such it may not be the ENA adapter at all that is causing this issue; read
more at
http://www.daemonology.net/blog/2017-11-17-FreeBSD-EC2-C5-instances.html

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-06-28 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #8 from Richard Paul  ---
Just to help you out with pts
-

To run the benchmark in pts you need to install it using the
phoronix-test-suite package:

#~ pkg install phoronix-test-suite
#~ phoronix-test-suite install pts/blogbench

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-06-28 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #7 from Richard Paul  ---
I missed a couple of configuration options we'd set:

--
sysctl:

kern.ipc.soacceptqueue: 8192
--

--
postgresql.conf

max_connections = 120   
shared_buffers = 2GB
effective_cache_size = 6GB  
checkpoint_completion_target = 0.9  
checkpoint_timeout = 1h 
work_mem = 2MB  
maintenance_work_mem = 256MB
max_locks_per_transaction = 128 
random_page_cost = 1.1  
max_worker_processes = 2 
--

I've changed the instance type to an r4.large and I'll tweak some of the
postgresql settings for the additional memory on the r4 instance and I'll set
it off again to attempt getting it to fall over.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-06-28 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #6 from Richard Paul  ---
Okay, with a bit of effort and this is testing against the v0.7.0 ENA driver on
FreeBSD 11.1p11 I got a reboot.  But this was not easy to get to reproduce.

Here's what I did to get this to die.eventually (it took about 3 hours)

 * Downsize the instance to an m5.large

 * I installed nginx on the DB server clone and started it (I'll detail config
below)

 * I installed the phoronix-test-suite and pts/blogbench

 * I kicked off a backup of our 115GB database to a local (800GB EBS vol.) ZFS
partion (the DB is held on a mirrored ZFS set on another pair of 250GB EBS
vol's)

 * I kicked off a stress run of the phoronix blogbench

 * With two t2.medium instances in the same VPC, I ran wrk -d 12h -c 2k -t2
http://10.0.0.10/




Additional configuration etc.:

--
root@os-upgrade-test-db:~ # setenv PTS_CONCURRENT_TEST_RUNS 8   
root@os-upgrade-test-db:~ # setenv TOTAL_LOOP_TIME 30
root@os-upgrade-test-db:~ # phoronix-test-suite stress-run pts/blogbench

Choose Option 3 for Test All Options
--

--
pkg info nginx-full 
nginx-full-1.12.2_11,2  
Name   : nginx-full 
Version: 1.12.2_11,2
Installed on   : Thu Jun 28 09:37:29 2018 UTC


___nginx.conf___

worker_processes  auto; 

events {
worker_connections  2048;   
}   


http {  
include   mime.types;   
default_type  application/octet-stream; 

sendfileon; 
tcp_nopush on;  

keepalive_timeout  65;  

gzip  on;   

server {
listen   80;
server_name  localhost; 

location / {
root   /usr/local/www/nginx;
index  index.html index.htm;
}   

location = /50x.html {  
root   /usr/local/www/nginx-dist;   
}   
}   
}
--

--
This is just to give you an idea of what we're doing, the backup is actually a
whole backups script file for doing this for each database in the RDBMS and
they rsyncing to to the offsite server

__Postgres dump__

sudo -u pgsql pg_dump -j 16 -Fd dbname -f /var/backups/outfile
--

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-06-28 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #5 from Richard Paul  ---
Hi Colin,

Thanks for responding to this issue.  You're right to point out that this may
be down to the difference in hypervisors.  M5 seem to be based on HVM too so
for our purposes maybe moving over to an r4.2xlarge would be our nearest
alternative in the r4 range of instances.  We're waiting for the inevitable
failure of our current instance as it seems to be falling over after 6-7 days.

What we need is some kind of reproducible test case for this to better be able
to diagnose the issue.  As the other reporters in this ticket say, this can
take hours to a day to reproduce and in our case, multiple days, which makes
finding such a test case so time consuming and difficult.

This is what I'm currently attempting to do with a cloned M5 instance type of
our DB server running the database dump and then trying to load the server
heavily but it's not producing much in the way of results at the moment.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-06-27 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #4 from Colin Percival  ---
Sorry, I'm coming to this late -- somehow I never saw this PR earlier.

It's possible that this is an ENA driver bug, but C5 also has the added
complication of using an entirely different virtualization platform, and I'm a
bit suspicious of the backtrace here.  Can one of you try to reproduce this on
a different instance type -- m4.16xlarge or r4.* would probably be best -- so
we can see if it's specifically an *ENA* problem or a *C5* problem?

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-06-26 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

--- Comment #3 from Richard Paul  ---
(In reply to Richard Paul from comment #2)

Just a quick grab of our DB server's current throughput with `systat -ifstat
-pps`:

ena0  in  4.498 Kp/s  4.498 Kp/s  279.347 Mp
  out12.956 Kp/s 12.956 Kp/s  674.161 Mp

We have a couple of Varnish servers in front of this platform (r4.2xlarge)
which are rock solid and which don't see anything like this kind of throughput
which is possibly why it's only this server we're seeing this issue with.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-06-25 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

Richard Paul  changed:

   What|Removed |Added

 CC||rich...@primarysite.net

--- Comment #2 from Richard Paul  ---
We're seeing this now since we migrated out instance between instance types. 
Before we were using a c5.9xlarge instance and we recently scaled that back
down (we'd rebuilt and scaled up earlier in the year to deal with seasonal
load).

This instance is our primary db instance and the kernel panics seem to happen
either during the db dump process (which occur when we have the least amount of
DB traffic) or as has happened once so far, during the peak of the daily load.

Our log output is almost the same except as that already submitted, although
out current process line reads:

 ```current process  = 12 (irq269: ena0)```

The two previous reports had the same IRQ number and are on the same class of
instance types whereas we're on an M5 class instance type and get a slightly
different IRQ number.  Also, this was happening on 11.1p4 but we since upgraded
to p10 and the issue still is occurring.  I'm planning on upgrading a clone to
11.2p0 later this week to check out whether there's a new version of the ENA
(from here https://github.com/amzn/amzn-drivers/tree/master/kernel/fbsd/ena) in
this build.

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"


[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-05-22 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

Terje Elde  changed:

   What|Removed |Added

 CC||te...@elde.net

--- Comment #1 from Terje Elde  ---
We're also affected by this, running c5.large, handling about 13 000
connections through haproxy, then varnish and on to other systems.  Activity
was about 4000 requests pr. minute leading up to the crash, which doesn't seem
all that high.  It's possible that it could have spiked shortly before the
crash though, without getting that in the logs.

This is:
FreeBSD [host snipped] 11.1-RELEASE-p8 FreeBSD 11.1-RELEASE-p8 #0: Tue Mar 13
17:07:05 UTC 2018
r...@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64

It's a lightly modified/configured version of one of the usual FreeBSD AMIs, I
don't recall the AMI ID exactly, sorry.  Kernel etc is stock, we've just made
additions in terms of software etc for our own AMI.

We have two virtually identical machines exposed under the same hostname,
receiving a near identical load, and have so far only been noticing this with
one of the machines.  Could be coincidental, but figured it worthwhile to
mention.

It strikes me as noteworthy that the data rate was only about 700kBps at the
last data point I have before the crash.  Unfortunately I don't know anything
about packet rate, and again it's possible that there could have been a peak
leading up to the crash, without getting the logs of it.

If anyone is interested in any other data from this, please do let me know. 
Also, this is part of a redundant setup, allowing some extra room for moving
things around if anyone wants anything tested or tried on the setup.

>> Crash itself:

Limiting open port RST response from 457 to 200 packets/sec
Limiting open port RST response from 487 to 200 packets/sec
Limiting open port RST response from 541 to 200 packets/sec
Limiting open port RST response from 517 to 200 packets/sec
Limiting open port RST response from 586 to 200 packets/sec
Limiting open port RST response from 237 to 200 packets/sec
ena0: Found a Tx that wasn't completed on time, qid 1, index 324.
pid 3639 (varnishd), uid 429: exited on signal 6
Limiting open port RST response from 259 to 200 packets/sec
Limiting open port RST response from 380 to 200 packets/sec
ena0: Found a Tx that wasn't completed on time, qid 1, index 181.


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x1c
fault code  = supervisor write data, page not present
instruction pointer = 0x20:0x82173f8c
stack pointer   = 0x28:0xfe0110f43180
frame pointer   = 0x28:0xfe0110f43260
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process = 12 (irq261: ena0)
trap number = 12
panic: page fault
cpuid = 0
KDB: stack backtrace:
#0 0x80aadac7 at kdb_backtrace+0x67
#1 0x80a6bba6 at vpanic+0x186
#2 0x80a6ba13 at panic+0x43
#3 0x80ee3092 at trap_fatal+0x322
#4 0x80ee30eb at trap_pfault+0x4b
#5 0x80ee290a at trap+0x2ca
#6 0x80ec3d40 at calltrap+0x8
#7 0x80a321ec at intr_event_execute_handlers+0xec
#8 0x80a324d6 at ithread_loop+0xd6
#9 0x80a2f845 at fork_exit+0x85
#10 0x80ec4a0e at fork_trampoline+0xe
Uptime: 8d22h59m55s
Rebooting...


>> boot log:

Copyright (c) 1992-2017 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 11.1-RELEASE-p8 #0: Tue Mar 13 17:07:05 UTC 2018
r...@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64
FreeBSD clang version 4.0.0 (tags/RELEASE_400/final 297347) (based on LLVM
4.0.0)
VT(vga): text 80x25
CPU: HammerEM64T (3000.05-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x50653  Family=0x6  Model=0x55  Stepping=3
 
Features=0x1f83fbff
 
Features2=0xfffa3203
  AMD Features=0x2c100800
  AMD Features2=0x121
  Structured Extended
Features=0xd11f4fbb
  Structured Extended Features2=0x8
  XSAVE Features=0xf
  TSC: P-state invariant, performance statistics
Hypervisor: Origin = "KVMKVMKVM"
real memory  = 5114953728 (4878 MB)
avail memory = 

[Bug 225791] ena driver causing kernel panics on AWS EC2

2018-02-12 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

Mark Linimon  changed:

   What|Removed |Added

   Assignee|freebsd-b...@freebsd.org|freebsd-virtualization@Free
   ||BSD.org

-- 
You are receiving this mail because:
You are the assignee for the bug.
___
freebsd-virtualization@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-virtualization
To unsubscribe, send any mail to 
"freebsd-virtualization-unsubscr...@freebsd.org"