Unfortunately, the system is unusable this morning. Still trying to
recover it.  May have to flatline it again.

It seems I have gotten myself stuck in a loop:
1. try to reboot and that causes kernel panic
2. after that happens a few times, the NVME needs fsck'd because of corrupt 
group descriptors
3. `fsck -CVvfy` the drive (twice for the ext partition and once for the EFI)
4. after doing 1-3 a few times, packages and symlinks start getting broken.  I 
try to manually repair them until eventually I can't get into the system 
anymore.

I tried to run memtest.  If it is set to 1 cpu at a time, it goes
without error until it eventually hangs on a random (inconsistent) test.
If I run with all cpus, it shows tons of errors pretty quickly.  Always
on the same bit of every bank (ie: 80808080 -> 8080A080) and always off
by two.  But again, it doesn't do that unless multiple cpus are running
at the same time. I thought it could be the other security features
(interleaving, memory encryption, etc) that the BIOS has set to auto.

Launching the live usb and just sitting at a terminal with `journalctl
--follow`, the last thing that happens before it hangs is usually
cleaning temp files; but I haven't run that enough to know if it is a
pattern.

>From the BIOS, I can set it to auto overclock or manual -- there is no
option to disable overclocking; so I cleared the CMOS and tried again
immediately after that without any change.

I have attempted 44 bionic installs this month.  4 of those went through
to completion. Two normal and two minimal.  The rest failed during
ubiquity.

grub-install almost always succeeds when acpi=off and almost always
hangs when it isn't.

I also have to have pcie_aspm=off or the system is spammed with errors
and crashes quickly.  Others have reported the same thing for
threadripper.

I have tried with and without livepatch enabled.

The system is stable when mining or gaming, and seems unstable when
underutilized -- so I tried disabling the C-states in the BIOS.  I have
tried disabling every form of power management I could find in the OS
and in the BIOS.  I am sure I have missed quite a few.

I have tried manually updating the kernel (per your requests) as well as
using ukuu.  Since it is my primary machine, I tend to have things
installed that have to then be uninstalled for that to work well (like
nvidia drivers, virtualbox, etc).

I am seeing a ton of segfaults, even from the live usb.  It more often
happens when the machine is sitting idle for a few minutes (which is
what had me thinking about power management).  I thought it could be the
memory, but since they don't fail memtest (if I run then 1 cpu at a
time)....

I know that "Erase disk and reinstall" will not solve the problem.  It
would be nice to figure out how to solve the problem before I do that
again.

So... I'm not sure how I can try a new kernel for you.  If there is some
way for me to update a live usb with an alternate kernel from a live
usb; that might work since I see errors on the daily bionic iso as well.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1765838

Title:
  BUG: Bad rss-counter state mm:000000002ddfedce idx:2 val:-1

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  Booted. Started firefox. A couple seconds later it was back on the
  lock screen. Logged in again and it hung.

  This is a fresh "Erase disk and reinstall" of minimal from yesterdays
  bionic iso

  Ubuntu 4.15.0-15.16-generic 4.15.15

  ProblemType: KernelOops
  DistroRelease: Ubuntu 18.04
  Package: linux-image-4.15.0-15-generic 4.15.0-15.16
  ProcVersionSignature: Ubuntu 4.15.0-15.16-generic 4.15.15
  Uname: Linux 4.15.0-15-generic x86_64
  NonfreeKernelModules: nvidia_modeset nvidia
  Annotation: Your system might become unstable now and might need to be 
restarted.
  ApportVersion: 2.20.9-0ubuntu5
  Architecture: amd64
  Date: Fri Apr 20 12:13:12 2018
  Failure: oops
  InstallationDate: Installed on 2018-04-19 (0 days ago)
  InstallationMedia:
   
  MachineType: System manufacturer System Product Name
  OopsText:
   BUG: Bad rss-counter state mm:000000002ddfedce idx:2 val:-1
   TaskSchedulerFo[35882]: segfault at 5cf3c85816d8 ip 0000557fa5ed3ed0 sp 
00007ff500037420 error 4 in chrome[557fa4a32000+5cd4000]
   traps: wget[35886] general protection ip:7fbe54e7d2ff sp:7ffc7aa235a0 
error:0 in ld-2.27.so[7fbe54e70000+27000]
  ProcFB: 0 EFI VGA
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-15-generic.efi.signed 
root=UUID=d9b05c55-71bb-4f4f-bdfd-43dd79de4c1d ro reboot=pci pcie_aspm=off
  PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No 
PulseAudio daemon running, or not running as session daemon.
  RelatedPackageVersions: kerneloops-daemon N/A
  SourcePackage: linux
  Title: BUG: Bad rss-counter state mm:000000002ddfedce idx:2 val:-1
  UpgradeStatus: No upgrade log present (probably fresh install)
  dmi.bios.date: 12/21/2017
  dmi.bios.vendor: American Megatrends Inc.
  dmi.bios.version: 0902
  dmi.board.asset.tag: Default string
  dmi.board.name: ROG ZENITH EXTREME
  dmi.board.vendor: ASUSTeK COMPUTER INC.
  dmi.board.version: Rev 1.xx
  dmi.chassis.asset.tag: Default string
  dmi.chassis.type: 3
  dmi.chassis.vendor: Default string
  dmi.chassis.version: Default string
  dmi.modalias: 
dmi:bvnAmericanMegatrendsInc.:bvr0902:bd12/21/2017:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnROGZENITHEXTREME:rvrRev1.xx:cvnDefaultstring:ct3:cvrDefaultstring:
  dmi.product.family: To be filled by O.E.M.
  dmi.product.name: System Product Name
  dmi.product.version: System Version
  dmi.sys.vendor: System manufacturer

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1765838/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to