[Bug 15946] critical shutdown because of bogus temperature got from EC address space - Asus V1S

bugzilla-daemon Sun, 13 May 2012 14:00:33 -0700

https://bugzilla.kernel.org/show_bug.cgi?id=15946



Xavier Hourcade <public....@xapaho.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |public....@xapaho.com




--- Comment #81 from Xavier Hourcade <public....@xapaho.com>  2012-05-13 
21:00:17 ---
Hello all!

I cannot believe I had not found this ticket earlier, thanks! :)

> Finally more people appear who suffer from this bug too.

Exactly :D From my own searches, I suspect many more like us out there.
Not a developer myself, but *extremely* motivated to contribute on this one.
This is an over-frustrating bug, I can reproduce it often and provide data.

Hardware is also Asus V1S, BIOS 301, with the "infamous" NVidia 8600M GT
(replaced once under warranty, as Windoz driver "had let it burn" they said).

Issue still occurs under Fedora 16 KDE latest stable.
No matter which recent kernel/mod/DE I am running. Currently:
  kernel-3.3.4-3.fc16.x86_64
  kmod-nvidia-295.49-1.fc16.1.x86_64
  kde-workspace-4.8.3-3.fc16.x86_64

Each time, without exception:
- Fan speed goes to maximum, all of a sudden, during few seconds
- Then the system powers off immediately.
- No matter when I next restart the laptop, fan is at maximum speed again
  during a few seconds, typically the time to reach grub screen
So there is for sure a flag set somewhere on the hardware/BIOS side.

System was just unusable in production with Fedora 14 kernels and nouveau.
Then switched to kmod-nvidia, issue had almost disappeared (once p.month
maybe).

With Fedora 16 kernels I first tried nouveau again, as unusable as before.
Then switched to kmod-nvidia and it became more, but "randomly" usable
(from unusable, to several p.week with "non-sense" exceptions).

10 days ago, cooler cleaned & thermal pastes changed :
- greatly dropped overall temperatures by 10+ degrees Celcius
- greatly reduced occurrences, from several p.day to several p.week
- permits again the use of dual head i.e. nVidia TwinView
- system is silent 90% of the time, as new, fan speed doesn't even go any high
  unless I really put the system under long and heavy load

Since soon-after KDE 4.8.3 (where KWin seems to be less GPU aggressive ?)
- reduced even more the average frequency of the issue
  (twice a week, plus "unfortunate non-sense exceptions"!).
Hence my increased suspicion of nVidia board.

Critical shutdowns now occur more randomly than ever before:
- None over the past 3 days with dual display, VBox guest, "heavy load and
all".
- Today I had three in a row (!) with internal display only, almost no load.
  Worth to mention (?) outside air temperature was clearly hotter than usual.
  Last of this awful series of 3, did occur at boot during KDE session opening,
  which was the first time ever, so I wouldn't say this are getting better.
  But then, no further shutdown until now, 4 hours later.
- Could occur, typically, while closing/re-opening several times consecutively
  eg. Firefox web browser, multiple tabs with "heavy" or rather long web pages,
  (such as redhat bugzilla * query results or kernel.org changelog :)

Whenever the random devil is on my side, I can run fine about anything at once:
- a VM guest (host partitions sit in a LUKS VG, hence it's rather intensive)
- yum updates, prelink -a, rpm -Va, rsync some 20-gigabytes-sized tarballs
- play some flash video full screen on internal display with Firefox
- play some other video full screen an external display with VLC
- also run browser, email/chat/IRC clients, even desktop effects
and still fail to cause the issue :
- CPU/GPU would both come close to a maximum of 78ºC
- unique fan spins to maximum speed, succeeds to cool the system down
- KDE is only getting a little slow, nothing else.
GNU/Linux rocks, then :)

Critical shutdowns, as ever, "may or may not" appear in /var/log/messages.
System "may or may not" have time to step down RLs (or just sync disk ?)
Whenever logged, the "bogus" 127 ºC value is always reported.

So, I wrote a poor-man watchdog/logger script to track this. So far
- I run it within a root screen started from tty2, when KDM login is ready
- It uses `nvidia-smi` and `apci` to monitor temperatures and set priorities.
  I shall now modify it to read *also* from /sys/class/thermal,
  as well as `nvidia-settings -t -q` for the GPU
- it reduces delay between readings as the priority is increasing

      CPU_NOTICE  CPU_WARNING  CPU_CRITICAL ->   72 74 76
      GPU_NOTICE  GPU_WARNING  GPU_CRITICAL ->   70 72 74
WAIT WAIT_NOTICE WAIT_WARNING WAIT_CRITICAL -> 3  2  1  0

More test if >= Warning eg. CPU load average (which increases delay by 0.5s)
Even more if >= Critical:  top 5 processes for CPU and RAM usage.
Logger is called at each first or returning "notice", or any above prio.
Sync is run after each logger, but it doesn't seem to help catching any more.
Stars as in "ºC*", denote a new highest temperature within current execution.

# grep -E "ºC|temperature|\(proc\) stopped|kmsg started" /var/log/messages
May 13 16:46:36 venus /root/heat.sh:   log | notice   | GPU:  69 ºC   82 MB |
CPU:  70 ºC
May 13 16:56:02 venus /root/heat.sh:   log | notice   | GPU:  69 ºC* 105 MB |
CPU:  71 ºC*
May 13 16:56:16 venus /root/heat.sh:   log | notice   | GPU:  69 ºC   82 MB |
CPU:  72 ºC*
May 13 17:01:12 venus kernel: imklog 5.8.10, log source = /proc/kmsg started.
May 13 17:07:34 venus kernel: [  460.807678] Critical temperature reached (127
C), shutting down.
May 13 17:07:34 venus kernel: Kernel logging (proc) stopped.
May 13 17:10:59 venus kernel: imklog 5.8.10, log source = /proc/kmsg started.
May 13 17:25:08 venus /root/heat.sh:   log | notice   | GPU:  68 ºC*  58 MB |
CPU:  69 ºC*
May 13 18:45:04 venus /root/heat.sh:   log | notice   | GPU:  67 ºC*  86 MB |
CPU:  73 ºC*
May 13 18:45:15 venus /root/heat.sh:   log | notice   | GPU:  68 ºC*  91 MB |
CPU:  72 ºC

My poor-man conclusion so far :
- Warning wasn't even reached when 2nd of today's shutdowns occurred at 17:00
  (which was *not* logged)
- Notice wasn't even reached when 3rd of today's shutdowns occurred at 17:07
  (which *was* magically logged this time)
- No such real temperature jump could occur within a 1-second laps time
- here the "127" value either is bogus indeed, or means something else.

Any suggestion more than welcome.
Thank you if you read through :)

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
You are watching the assignee of the bug.
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
acpi-bugzilla mailing list
acpi-bugzilla@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/acpi-bugzilla

[Bug 15946] critical shutdown because of bogus temperature got from EC address space - Asus V1S

Reply via email to