Bug#647095: CPU hyperthreading turned on after soft power-cycle

2011-12-02 Thread Clarinet



Ok, this also confirms that the board had issues *before* any changes
were made to the RTC core. I'd push the board vendor to update the BIOS
to avoid this issue.

Even so, I'm curious as to what exactly trips it up. Maybe we can
provide a module option for the rtc-cmos driver to disable the alarm
functionality, so you can at least avoid the issue until the board
vendor fixes the problem (if ever).

Assuming its the alarm being set, could you try the following on a
current kernel and let me know if it still shows the problem? hwclock
might throw some odd messages with this test patch, but those can be
ignored.


John,

I apllied the patch to 2.6.38 and tested the patched kernel - it is 
bad, i.e. it exhibits the strange behavior the same way as unpatched 
2.6.38.


I understand that BIOS is bad, but I am also very curious what exactly 
in the kernel reveals the problem. Please let's go on with testing.


By the way, why do you think the problem appeared only when halt was 
called after running rtctest, and did not appear when reboot was 
called after running rtctest?


Best regards,

Jiri



thanks
-john

diff --git a/drivers/rtc/rtc-cmos.c b/drivers/rtc/rtc-cmos.c
index 05beb6c..d9814aa 100644
--- a/drivers/rtc/rtc-cmos.c
+++ b/drivers/rtc/rtc-cmos.c
@@ -305,8 +305,8 @@ static void cmos_irq_enable(struct cmos_rtc *cmos, unsigned 
char mask)
cmos_checkintr(cmos, rtc_control);

rtc_control |= mask;
-   CMOS_WRITE(rtc_control, RTC_CONTROL);
-   hpet_set_rtc_irq_bit(mask);
+// CMOS_WRITE(rtc_control, RTC_CONTROL);
+// hpet_set_rtc_irq_bit(mask);

cmos_checkintr(cmos, rtc_control);
  }





--
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4ed8ac1a.3050...@atlas.cz



Bug#647095: CPU hyperthreading turned on after soft power-cycle

2011-11-29 Thread Clarinet



Using an older known-good kernel, could you build and run the test
case at the end of Documentation/rtc.txt a few times and see if it
triggers the same problem?

I'm suspicious that the setting the alarm is whats tripping the BIOS
into enabling the HT bit. Because with older kernels, we used PIE mode
irqs which hwclock usually uses at boot, but with newer kernels, we
emulate PIE via AIE alarm mode. So if the BIOS was broken before, you
wouldn't have noticed unless you tried to use AIE irqs.

If this doesn't work, I'll get some patches to both 2.6.27 and 2.6.28
kernels to debug the exact flow of how we're touching the hardware and
then we can further narrow it down.


I ran the tests the following way:

- boot 2.6.37.6 - check /proc/cpuinfo - 12 processors
- halt
- boot 2.6.37.6 - check /proc/cpuinfo - 12 processors
- run rtctest
- reboot
- boot 2.6.37.6 - check /proc/cpuinfo - 12 processors
- halt
- boot 2.6.37.6 - check /proc/cpuinfo - 12 processors
- run rtctest
- halt
- boot 2.6.37.6 - check /proc/cpuinfo - 24 processors

So the conclusion is that only if rtctest is run and the machine is 
halted, it triggers the HT problem. Reboot seems to neutralize 
whatever rtctest did.


Jiri



--
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4ed4cf7c.8020...@atlas.cz



Bug#647095: CPU hyperthreading turned on after soft power-cycle

2011-11-16 Thread Clarinet


Hi all,


Result of bisecting: v2.6.38-rc1 exhibits the problem. v2.6.37 and
many of the topic branches merged in the 2.6.38 merge window work ok.
Some other topic branches do not boot at all.

Jiri: if you have gitk installed, then git bisect visualize can help
get a sense of what's in the middle of the regression range.
gitk --bisect --first-parent v2.6.37..v2.6.38-rc1 might be a good way
to find mainline commits to test before finding a topic branch to delve
into.


I have been able to narrow the interval manually a little bit from the
top (the bad side) and I will go on from the bottom now. However,
there seems to be a large area where kernels are unbootable for me -
they mostly stop when init is called and I do not know why.


Finally! After another 50+ compilations a have it! It took some time as 
first I had to find a reason why some revisions did not boot (almost 2/3 
were unbootable and the first bad commit was among them). Having this 
solved I have been able to bisect without skipping. The result is 
surprising (at least for me) - believe it or not, the first bad commit 
is 6610e089 RTC: Rework RTC code to use timerqueue for events from 
John Stultz (I am sending him a copy of this message).


I would never expect this would be a problem, but my understanding of 
this commit is very limited, so I am certainly missing the point. 
However, I have tried to compile 2.6.38 (which was bad) with Real 
Time Clock configuration option turned off and it behaves normally 
then (= is good).


Can you please comment this result? What does it mean? Any idea what is 
wrong there?


Best regards,

Jiri Polach




--
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4ec43df7.4010...@atlas.cz



Bug#647095: CPU hyperthreading turned on after soft power-cycle

2011-11-11 Thread Clarinet


Hi all,


Hi Jiri,

Jiri Polach wrote:


On Ben's advice I am trying to locate the commit that causes the problem to
appear more precisely using 'git bisect'. However, too many of generated
revisions are unbootable so I have to use 'bisect skip' frequently.


Ok, so I've looked over the log athttp://bugs.debian.org/647095, and
this seems totally weird.  Have I described the symptoms correctly below?
(Warning: I am making some guesses, especially at step 5.  In case of
doubt, see the bug log just mentioned.)

1. Disable SMT in the BIOS.

2. Boot a bad kernel.  /proc/cpuinfo (correctly) shows one entry
   per core.

3. shutdown -h now.  Enter BIOS.  SMT is still disabled.
   Don't save.

4. Boot any kernel.  /proc/cpuinfo shows two entries per core.

5. shutdown -h now.  Boot any kernel.  /proc/cpuinfo still shows
   two entries per core.

6. shutdown -h now.  Enter BIOS.  SMT is still disabled.  Save.
   Now /proc/cpuinfo will (correctly) shows one entry per core.

Reproducible for Jiri with v3.0.4.


Yes, this is exactly how it works. Something happens when kernel shuts 
down. Not when kernel reboots.



Result of bisecting: v2.6.38-rc1 exhibits the problem.  v2.6.37 and
many of the topic branches merged in the 2.6.38 merge window work ok.
Some other topic branches do not boot at all.

Jiri: if you have gitk installed, then git bisect visualize can help
get a sense of what's in the middle of the regression range.
gitk --bisect --first-parent v2.6.37..v2.6.38-rc1 might be a good way
to find mainline commits to test before finding a topic branch to delve
into.


I have been able to narrow the interval manually a little bit from the 
top (the bad side) and I will go on from the bottom now. However, 
there seems to be a large area where kernels are unbootable for me - 
they mostly stop when init is called and I do not know why.



x86 people: do the symptoms seem familiar?  Any hints for tracking it
down?


Please! I have spent more than a month trying to resolve it. I cannot 
revert back to 2.6.37 kernels and I cannot live with SMT changing on 
every shutdown - I have too many servers to allow such unusual behavior ...


Thank you,

Jiri Polach


Thanks and hope that helps,
Jonathan





--
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4ebd2825.6050...@atlas.cz



Bug#647095: CPU hyperthreading turned on after soft power-cycle

2011-10-31 Thread Clarinet

On 10/30/2011 4:25 PM, Ben Hutchings wrote:

On Sun, 2011-10-30 at 07:05 -0400, Jiri Polach wrote:

Package: linux-2.6
Version: 2.6.39-3~bpo60+1
Severity: normal


When the computer is turned off using shutdown -h or halt command,
the hypertherading BIOS setting is changed - even if hypertherading is
disabled in BIOS, the kernel detects twice as many processors on
next boot as if hyperthreading was enabled. Please see details below.

I have observed the problem on several Supermicro platforms with
various Intel Xeon processors. The particular case I report was
observed on Supermicro X8DTT-F mainboard with two Intel Xeon E5645
processors (6core). The problem can be reproduced the following way:


By my understanding of how hyperthreading is controlled, this has to be
a BIOS bug, as you seem to have suspected.  But if the BIOS behaviour is
kernel version-dependent, then presumably there is something the kernel
can do to work around it.


Yes, there are reasons that support my suspicion that BIOS is not doing 
its work properly. But I cannot prove it until it is clear what has been 
changed in the kernel.



1. Turn on the computer, go to BIOS setup and turn Simultaneous
multithreading to Disabled. Boot Debian.

2. Check with cat /proc/cpuinfo that the system reports 12 CPUs (2 x
six-core processor).

3. (optionally) Reboot the system (shutdown -r) and check that there
are still 12 CPUs detected and reported.

4. Halt the system using shutdown -h or halt, turn it on again,
and boot Debian.


I assume from this that shutdown -h is configured to turn the system
off.


I do not know. I have been using mostly halt to shutdown the system 
and turn the server off and I tried shutdown -h only several times to 
see if there is any difference. Both commands have turned the computer 
off, but I did not do any special shutdown -h configuration.



5. Check the number of CPUs reported - it will show you that there are
24 CPUs as if hyperthreading was enabled.

6. Reboot and go to BIOS setup - it still shows that Simultaneous
multithreading is set to Disabled. Do not change anythig, just
select Save and Exit. Boot Debian and check the number of CPUs - it
now shows 12 CPUs again.

I have tested several kernel versions and it seems that this behavior
appeared for the first time somewhere between 2.6.35.7 and 2.6.38.6
versions (ok = does not show the decribed behavior, not ok = does
show):

* linux-image-2.6.32-5-amd64 official Debian - ok
* linux-image-2.6.39-bpo.2-amd64 official Debian from backports - not
ok

* linux 2.6.35.7 - custom compiled from source - ok
* linux 2.6.38.6 - custom compiled from source - not ok
* linux 2.6.39.4 - custom compiled from source - not ok
* linux 3.0.4 - custom compiled from source - not ok


That might be too large a range for developers to consider.  Can you
test some versions between 2.6.35.7 and 2.6.38.6 (bisection)?


OK, after another day of testing it seems that the problem appears in 
2.6.38.1, because


* linux 2.6.37.6 - custom compiled from source - ok
* linux 2.6.38.1 - custom compiled from source - not ok

Best regards,

Jiri Polach


Ben.


I have exchnged many e-mails with Supermicro distributor who
apparently is in direct contact with Supermicro technicians. They more
or less deny any responsibility for this problem and repeatedly point
to the fact that some (older) kernels do not exhibit this behavior so
it must be a kernel problem. Their representative writes:

I discussed this with supermicro and they informed me that the Kernel
itself is causing the issue, that it may be sending the hyperthreading
command code to the BIOS.

Although I do not completely agree with their arguments, my knowledge
is not deep enough to recognize where exactly the core of the problem
is so I report this as a bug in a hope that someone will know what
happens when a kernel turns a computer off and what has changed in
kernel somewhere between the versions I mention above. I have asked
Supermicro distributor for more information on what they think happens
there and what exactly they mean by hyperhreading command code and I
am waiting for their response.

-- Package-specific info:
** Version:
Linux version 2.6.39-bpo.2-amd64 (Debian 2.6.39-3~bpo60+1) 
(norb...@tretkowski.de) (gcc version 4.4.5 (Debian 4.4.5-8) ) #1 SMP Tue Jul 26 
10:35:23 UTC 2011

[...]

** Model information
sys_vendor: Supermicro
product_name: X8DTT
product_version: 1234567890
chassis_vendor: Supermicro
chassis_version: 1234567890
bios_vendor: American Megatrends Inc.
bios_version: 080016
board_vendor: Supermicro
board_name: X8DTT
board_version: 2.0

[...]





--
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/4eae9d3a.7000...@atlas.cz