Bug#920547: Crashes every few hours

2019-01-28 Thread Toni Mueller



Hi Ben,

On Mon, Jan 28, 2019 at 12:42:37AM +, Ben Hutchings wrote:
> On Sat, 26 Jan 2019 20:03:49 + Toni  wrote:
> > Package: src:linux
> > Version: 4.19.16-1
> > Severity: critical
> > File: linux-image-4.19.0-2-amd64
> 
> Is this a new problem with version 4.19.16-1?  Or did it happen with
> earlier versions as well?

it happened with the 4.18.* kernel as well. The machine came with Ubuntu
and 4.13 preinstalled, but I wiped it as soon as I could and installed
Debian. So I don't know if it would have worked with Ubuntu - the entire
setup was not suitable for my purposes, but I thought that 4.9 might be
too old for this hardware.

However, the machine came with a 1.3 BIOS, which I updated to 1.6 and
then to 1.7. I think, I had 4.18 together with 1.6 running, but closed
the corresponding bug report when I noticed that both a newer kernel and
a newer BIOS were available. Well, the situation compared has improved a
little, compared to that, but it is still very bad.

> When you say "data loss", are you talking about data in memory or
> corruption of files that were saved and sync'd to disk?

I mean, files on disk were destroyed. I noticed some because I use
etckeeper with git, and suddenly, I could no longer see my update
history because files in /etc/.git were corrupt to the point that no
"git fsck" or "git gc" could resurrect the tree.

> On x86 laptops thermal management is (by default) done by the system
> firmware (BIOS and management engine code).  If you didn't override
> that, and yet the CPU overheats, this is the manufacturer's fault.

Ok... In the BIOS, I set the corresponding parameter from "performance"
to "normal", which I hoped would be a more conservative setting, to
prevent exactly this problem.


Cheers,
Toni



Bug#920547: Crashes every few hours

2019-01-28 Thread Julien Aubin
On Sun, 27 Jan 2019 15:10:39 -0500 Chris Manougian  wrote:
> Hi Toni. I have an XPS  15 9570, which, I think, is basically the same
> machine, except yours uses an NVIDIA Quadro vs my GeForce GTX 1050Ti as a
> 2nd graphics card.
>
> A lot of problems with that secondary graphics card and linux.  Are you
> attempting to use it via Bumblebee?
>
> See this thread (and links within the thread) - BIOS related:
> https://bugzilla.redhat.com/show_bug.cgi?id=1610727
>
> I did my best to disable the NVIDIA card:
> https://wiki.archlinux.org/index.php/Dell_XPS_15_9570
>
> One of my more recent "important" gnome-logs file is:
>
> 03:16:35 kernel: ath10k_pci :3b:00.0: firmware: failed to load
> ath10k/cal-pci-:3b:00.0.bin (-2)
> 03:16:35 kernel: firmware_class: See https://wiki.debian.org/Firmware for
> information about missing firmware
> 03:16:35 kernel: ath10k_pci :3b:00.0: firmware: failed to load
> ath10k/pre-cal-pci-:3b:00.0.bin (-2)
> 03:16:34 kernel: iTCO_wdt iTCO_wdt: can't request region for resource [mem
> 0x00c5fffc-0x00c5]
> 03:16:34 kernel: ACPI Error: Skip parsing opcode OpcodeName unavailable
> (20180531/psloop-542)
> 03:16:34 kernel: ACPI Error: Skip parsing opcode OpcodeName unavailable
> (20180531/psloop-542)
> 03:16:34 kernel: ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog
> (20180531/psobject-221)
> 03:16:34 kernel: ACPI BIOS Error (bug): Failure creating
> [\_SB.PCI0.XHC.RHUB.SS10._PLD], AE_ALREADY_EXISTS (20180531/dswload2-316)
> 03:16:34 kernel: ACPI Error: Skip parsing opcode OpcodeName unavailable
> (20180531/psloop-542)
> 03:16:34 kernel: ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog
> (20180531/psobject-221)
> 03:16:34 kernel: ACPI BIOS Error (bug): Failure creating
> [\_SB.PCI0.XHC.RHUB.SS10._UPC], AE_ALREADY_EXISTS (20180531/dswload2-316)
> 03:16:34 kernel: ACPI Error: Skip parsing opcode OpcodeName unavailable
> (20180531/psloop-542)
> 03:16:34 kernel: ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog
> (20180531/psobject-221)
> 03:16:34 kernel: ACPI BIOS Error (bug): Failure creating
> [\_SB.PCI0.XHC.RHUB.SS09._PLD], AE_ALREADY_EXISTS (20180531/dswload2-316)
> 03:16:34 kernel: ACPI Error: Skip parsing opcode OpcodeName unavailable
> (20180531/psloop-542)
> 03:16:34 kernel: ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog
> (20180531/psobject-221)
> 03:16:34 kernel: ACPI BIOS Error (bug): Failure creating
> [\_SB.PCI0.XHC.RHUB.SS09._UPC], AE_ALREADY_EXISTS (20180531/dswload2-316)
> 03:16:34 kernel: ACPI Error: Skip parsing opcode OpcodeName unavailable
> (20180531/psloop-542)
> 03:16:34 kernel: ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog
> (20180531/psobject-221)
> 03:16:34 kernel: ACPI BIOS Error (bug): Failure creating
> [\_SB.PCI0.XHC.RHUB.SS08._PLD], AE_ALREADY_EXISTS (20180531/dswload2-316)
> 03:16:34 kernel: ACPI Error: Skip parsing opcode OpcodeName unavailable
> (20180531/psloop-542)
> 03:16:34 kernel: ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog
> (20180531/psobject-221)
> 03:16:34 kernel: ACPI BIOS Error (bug): Failure creating

Hi,

If it may help (I'm on testing w/ 4.9.12) this bug seems to be
NVidia-specific. Never encountered such things on my laptop which has
an AMD GPU with the FOSS driver. Reference is Dell Latitude e6540.

Rgds,



Bug#920547: Crashes every few hours

2019-01-27 Thread Ben Hutchings
Control: tag -1 moreinfo

On Sat, 26 Jan 2019 20:03:49 + Toni  wrote:
> Package: src:linux
> Version: 4.19.16-1
> Severity: critical
> File: linux-image-4.19.0-2-amd64

Is this a new problem with version 4.19.16-1?  Or did it happen with
earlier versions as well?

> my laptop lasts a few hours at most until becoming unresponsive, hot,
> and refuses to do normal things. Eg. trying to create this bug report
> and using sudo to read the kernel logs after about one hour of total
> uptime, with two suspend/resume cycles in between, made the system
> crash. "Crash" means that, in such a situation, I can only press the
> power button until the system is completely off, but after that, I am
> forced to immediately turn the system back on, so that the fans can do
> their work, because otherwise, the CPU overheats. Pressing
> Ctrl-Alt-Delete has no effect.
> 
> Justification for "grave": I've experienced data loss in such
> situations, and of course, having the entire system going down, with
> potential hardware damage (sans human intervention) is probably as bad
> as it can be.

When you say "data loss", are you talking about data in memory or
corruption of files that were saved and sync'd to disk?

On x86 laptops thermal management is (by default) done by the system
firmware (BIOS and management engine code).  If you didn't override
that, and yet the CPU overheats, this is the manufacturer's fault.

Ben.

> I've attached the dmesg from boot and some kernel logs for your perusal,
> cleansed from private data.

-- 
Ben Hutchings
We get into the habit of living before acquiring the habit of thinking.
 - Albert Camus




signature.asc
Description: This is a digitally signed message part