Bug#404148: kernel: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping

2007-04-04 Thread Steve Langasek
On Sat, Mar 31, 2007 at 11:29:23PM +0200, Christoph Anton Mitterer wrote:
> Steve Langasek wrote:
> > Well, there's no reason that someone can't use iommu=soft when booting the
> > installer, as well.  So perhaps it would be best to clone that bug and
> > include this information in the installation guide or errata?

> Yes that's a good idea.

> I assume it would be also a problem, too just set the installer to
> iomm=soft (e.g. via the bootloader)?

Yes, it is a problem; there is no window of opportunity for making such a
change before release (there was even less of one than for the kernel), and
the fix properly belongs in the kernel package, not in the installer or
bootloaders.

> One last thing perhaps. I'd include a link to the kernel.org bug report
> in your release notes text and maybe some information that systems might
> already have some data corruption (as this bug is not new).

Link to kernel.org is included; "systems might already have some data
corruption" is not relevant to the release notes that I can see, the release
notes are about upgrades from sarge which did not have this problem (because
it didn't support hw iommu at all).

> btw: Is the kernel team now aware of your patch and will it use it in
> following linux-* packages? i.e. in unstable?

The kernel team is aware of it, but no decision has been made yet to include
it in the kernel packages, since the verdict is still out upstream.

-- 
Steve Langasek   Give me a lever long enough and a Free OS
Debian Developer   to set it on, and I can move the world.
[EMAIL PROTECTED]   http://www.debian.org/


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#404148: kernel: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping

2007-03-31 Thread Christoph Anton Mitterer
Steve Langasek wrote:
> Well, there's no reason that someone can't use iommu=soft when booting the
> installer, as well.  So perhaps it would be best to clone that bug and
> include this information in the installation guide or errata?
>   
Yes that's a good idea.

I assume it would be also a problem, too just set the installer to
iomm=soft (e.g. via the bootloader)?

One last thing perhaps. I'd include a link to the kernel.org bug report
in your release notes text and maybe some information that systems might
already have some data corruption (as this bug is not new).

btw: Is the kernel team now aware of your patch and will it use it in
following linux-* packages? i.e. in unstable?

Best wishes,
Chris.
begin:vcard
fn:Mitterer, Christoph Anton
n:Mitterer;Christoph Anton
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



Bug#404148: kernel: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping

2007-03-31 Thread Steve Langasek
On Sat, Mar 31, 2007 at 07:59:44PM +0200, Christoph Anton Mitterer wrote:
> Andreas Barth wrote:
> > BTW, we intended to have frequent kernel uploads to proposed-updates,
> > and frankly speaking, I personally don't mind to already have a newer
> > kernel in proposed-updates during the release, but that's something I
> > want to have signed-off by Martin.
> The main problem with the whole release-notes-only-issue is,... that
> data corruption might already occur during installation. So even if the
> user reads the release notes (I assume this happens mostly (if at all)
> after installation) he might already have some corruptions.

Well, there's no reason that someone can't use iommu=soft when booting the
installer, as well.  So perhaps it would be best to clone that bug and
include this information in the installation guide or errata?

-- 
Steve Langasek   Give me a lever long enough and a Free OS
Debian Developer   to set it on, and I can move the world.
[EMAIL PROTECTED]   http://www.debian.org/


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#404148: kernel: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping

2007-03-31 Thread dann frazier
On Sat, Mar 31, 2007 at 03:58:49AM -0700, Steve Langasek wrote:
> On Sat, Mar 31, 2007 at 01:29:04AM +0200, Christoph Anton Mitterer wrote:
> 
> > As I've told you in my email before I just tested your patch with the
> > following results (used linux-source-2.6.18 (2.6.18.dfsg.1-12) from
> > testing, of course on an amd64 system):
> 
> > - The patch applies without problems
> > - The kernel compiles with it without problems (at least with my config)
> > - It boots correctly
> > - and it automatically disables the hardware iommu (look at my dmesg below):
> 
> Thanks, that's great to hear.

Agreed - good work on the patch Steve, and thanks for testing Christoph.

> > I would say (although I'm by any means not kernel expert) that your
> > patch looks good and I _strongly_ recommend to include it in etch r0 (!!)...
> > You're the release manager,... so you should get managed this :-)
> 
> It wouldn't be appropriate for me to push this without the consent of the
> rest of the kernel team just because I'm the release manager; I'm not even
> an amd64 porter, this should be signed off on by the folks who are actually
> responsible for the amd64 kernel first.

I see no reason not to include it in r1, at least until upstream finds
something better. Does anyone disagree?

-- 
dann frazier



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#404148: kernel: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping

2007-03-31 Thread Sven Luther
On Sat, Mar 31, 2007 at 08:18:26PM +0200, Andreas Barth wrote:
> * Sven Luther ([EMAIL PROTECTED]) [070331 16:03]:
> > The ideal would have been a framework where we could build new kernels and
> > have it integrated within a few days only. I gave a speach about this at
> > FOSDEM, of how we could use the initramfs incremental nature, to separate
> > fully the kernel module .udebs from the rest of d-i, and have actual d-i
> > images which are daily built, and usable independently of the kernel used.
> 
> Sven, sorry but this doesn't have anything to do with the installer now.
> But that we refrain from making new uploads of the kernel less than a
> week prior to release - for the simple reason the kernel *is* a central
> component.

So what ? The reality is that all progress in this direction was stoped cold
one year ago, with the consequences that we know, and that we face again the
exact same situation we had for sarge, which wwas released with a known
security hole.

Hurt,

Sven Luther


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#404148: kernel: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping

2007-03-31 Thread Andreas Barth
* Sven Luther ([EMAIL PROTECTED]) [070331 16:03]:
> The ideal would have been a framework where we could build new kernels and
> have it integrated within a few days only. I gave a speach about this at
> FOSDEM, of how we could use the initramfs incremental nature, to separate
> fully the kernel module .udebs from the rest of d-i, and have actual d-i
> images which are daily built, and usable independently of the kernel used.

Sven, sorry but this doesn't have anything to do with the installer now.
But that we refrain from making new uploads of the kernel less than a
week prior to release - for the simple reason the kernel *is* a central
component.


Cheers,
Andi
-- 
  http://home.arcor.de/andreas-barth/


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#404148: kernel: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping

2007-03-31 Thread Christoph Anton Mitterer
Andreas Barth wrote:
> BTW, we intended to have frequent kernel uploads to proposed-updates,
> and frankly speaking, I personally don't mind to already have a newer
> kernel in proposed-updates during the release, but that's something I
> want to have signed-off by Martin.
The main problem with the whole release-notes-only-issue is,... that
data corruption might already occur during installation. So even if the
user reads the release notes (I assume this happens mostly (if at all)
after installation) he might already have some corruptions.

Chris.
begin:vcard
fn:Mitterer, Christoph Anton
n:Mitterer;Christoph Anton
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



Bug#404148: kernel: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping

2007-03-31 Thread Sven Luther
On Sat, Mar 31, 2007 at 03:11:09PM +0200, Andreas Barth wrote:
> * Steve Langasek ([EMAIL PROTECTED]) [070331 12:59]:
> > On Sat, Mar 31, 2007 at 01:29:04AM +0200, Christoph Anton Mitterer wrote:
> > > I would say (although I'm by any means not kernel expert) that your
> > > patch looks good and I _strongly_ recommend to include it in etch r0 
> > > (!!)...
> > > You're the release manager,... so you should get managed this :-)
> > 
> > It wouldn't be appropriate for me to push this without the consent of the
> > rest of the kernel team just because I'm the release manager; I'm not even
> > an amd64 porter, this should be signed off on by the folks who are actually
> > responsible for the amd64 kernel first.  But regardless, there are no plans
> > for another kernel update before etch r0, and including one is likely to
> > delay the release.  I'm of the opinion that this bug does not justify a
> > delay at this point.
> > 
> > With the consent of the kernel team and the stable release managers, I'm
> > happy to commit this patch to the queue for the next kernel update though,
> > so it can be included in etch r1.
> 
> 
> BTW, we intended to have frequent kernel uploads to proposed-updates,
> and frankly speaking, I personally don't mind to already have a newer
> kernel in proposed-updates during the release, but that's something I
> want to have signed-off by Martin.

The ideal would have been a framework where we could build new kernels and
have it integrated within a few days only. I gave a speach about this at
FOSDEM, of how we could use the initramfs incremental nature, to separate
fully the kernel module .udebs from the rest of d-i, and have actual d-i
images which are daily built, and usable independently of the kernel used.

This is already the second release where such problems happen, so let's hope
that people get more reasonable about trying to solve this through the
available technical solution for lenny.

Because *IT IS POSSIBLE* :)

Friendly,

Sven Luther


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#404148: kernel: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping

2007-03-31 Thread Andreas Barth
* Steve Langasek ([EMAIL PROTECTED]) [070331 12:59]:
> On Sat, Mar 31, 2007 at 01:29:04AM +0200, Christoph Anton Mitterer wrote:
> > I would say (although I'm by any means not kernel expert) that your
> > patch looks good and I _strongly_ recommend to include it in etch r0 (!!)...
> > You're the release manager,... so you should get managed this :-)
> 
> It wouldn't be appropriate for me to push this without the consent of the
> rest of the kernel team just because I'm the release manager; I'm not even
> an amd64 porter, this should be signed off on by the folks who are actually
> responsible for the amd64 kernel first.  But regardless, there are no plans
> for another kernel update before etch r0, and including one is likely to
> delay the release.  I'm of the opinion that this bug does not justify a
> delay at this point.
> 
> With the consent of the kernel team and the stable release managers, I'm
> happy to commit this patch to the queue for the next kernel update though,
> so it can be included in etch r1.


BTW, we intended to have frequent kernel uploads to proposed-updates,
and frankly speaking, I personally don't mind to already have a newer
kernel in proposed-updates during the release, but that's something I
want to have signed-off by Martin.


Cheers,
Andi


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#404148: kernel: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping

2007-03-31 Thread Christoph Anton Mitterer
Steve Langasek wrote:
> But regardless, there are no plans
> for another kernel update before etch r0, and including one is likely to
> delay the release.  I'm of the opinion that this bug does not justify a
> delay at this point.
>   
Uhm, sad to hear this...

> With the consent of the kernel team and the stable release managers, I'm
> happy to commit this patch to the queue for the next kernel update though,
> so it can be included in etch r1.
>   
k,... perhaps we will have a real solution in the meantime. It seems
like AMD makes some progress.


>> But I would say that you should add some notes to the release notes.
>> 
> Yes, that's now bug #416374, which includes a suggested text.
k,.. at least something ;-)


Best wishes,
Chris.
begin:vcard
fn:Mitterer, Christoph Anton
n:Mitterer;Christoph Anton
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



Bug#404148: kernel: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping

2007-03-31 Thread Steve Langasek
On Sat, Mar 31, 2007 at 01:29:04AM +0200, Christoph Anton Mitterer wrote:

> As I've told you in my email before I just tested your patch with the
> following results (used linux-source-2.6.18 (2.6.18.dfsg.1-12) from
> testing, of course on an amd64 system):

> - The patch applies without problems
> - The kernel compiles with it without problems (at least with my config)
> - It boots correctly
> - and it automatically disables the hardware iommu (look at my dmesg below):

Thanks, that's great to hear.

> I would say (although I'm by any means not kernel expert) that your
> patch looks good and I _strongly_ recommend to include it in etch r0 (!!)...
> You're the release manager,... so you should get managed this :-)

It wouldn't be appropriate for me to push this without the consent of the
rest of the kernel team just because I'm the release manager; I'm not even
an amd64 porter, this should be signed off on by the folks who are actually
responsible for the amd64 kernel first.  But regardless, there are no plans
for another kernel update before etch r0, and including one is likely to
delay the release.  I'm of the opinion that this bug does not justify a
delay at this point.

With the consent of the kernel team and the stable release managers, I'm
happy to commit this patch to the queue for the next kernel update though,
so it can be included in etch r1.

> But I would say that you should add some notes to the release notes.

Yes, that's now bug #416374, which includes a suggested text.

Thanks,
-- 
Steve Langasek   Give me a lever long enough and a Free OS
Debian Developer   to set it on, and I can move the world.
[EMAIL PROTECTED]   http://www.debian.org/


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#404148: kernel: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping

2007-03-30 Thread Christoph Anton Mitterer
Hi Steve.

As I've told you in my email before I just tested your patch with the
following results (used linux-source-2.6.18 (2.6.18.dfsg.1-12) from
testing, of course on an amd64 system):

- The patch applies without problems
- The kernel compiles with it without problems (at least with my config)
- It boots correctly
- and it automatically disables the hardware iommu (look at my dmesg below):

Bootdata ok (command line is root=/dev/sda1 ro snd-ice1724.index=0
snd-intel8x0.index=1 )
Linux version 2.6.18debtest (Version:) ([EMAIL PROTECTED]) (gcc version
4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #1 SMP PREEMPT Sat Mar 31
00:42:51 CEST 2007
BIOS-provided physical RAM map:


  Normal zone: 387840 pages, LIFO batch:31
Nvidia board detected. Ignoring ACPI timer override.
Looks like an nvidia chipset. Disabling HW IOMMU. Override with
"iommu=allowed"
ACPI: PM-Timer IO Port: 0x8008


CPU 0: aperture @ ac00 size 64 MB
CPU 1: aperture @ ac00 size 64 MB
PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
Placing software IO TLB between 0x165 - 0x565
Memory: 4036552k/5767168k available (3007k kernel code, 156324k
reserved, 1245k data, 216k init)
Calibrating delay using timer specific routine.. 4422.28 BogoMIPS
(lpj=2211140)
Security Framework v1.0.0 initialized



So you see later on the kernel correctly reports to use the swiotlb.

I would say (although I'm by any means not kernel expert) that your
patch looks good and I _strongly_ recommend to include it in etch r0 (!!)...
You're the release manager,... so you should get managed this :-)

But I would say that you should add some notes to the release notes.

btw: I've CC'ed the mail to Andy so if you don't have time to do this he
might... uh and for Andy: have you already signed the etch release key
and did you have found some time to sign my personal key I gave you on
the last Stammtisch?!

Best wishes,
Chris.
begin:vcard
fn:Mitterer, Christoph Anton
n:Mitterer;Christoph Anton
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



Bug#404148: kernel: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!

2007-03-27 Thread Steve Langasek
clone 404148 -1
reassign -1 release-notes
tags 404148 etch-ignore
tags -1 -patch
thanks

Since no one was able to test the provided patch, linux-2.6 2.6.18.dfsg.1-12
has been uploaded to unstable without it, which means fixing this has missed
the last kernel upload for etch r0.

That leaves this as a documentation issue for the release notes.  Here is a
proposed description:

  Some amd64 systems which have nvidia chipsets and more than 3GB of RAM
  seem to have an issue with 32-bit PCI access that may result in sporadic
  data corruption when accessing disks.  Because this problem was still
  under investigation by the Linux kernel developers at the time of release,
  it was not possible to include a fix for this problem in the etch kernel
  packages.  As a workaround, users running etch on such a system are
  recommended to add 'iommu=soft' to their kernel boot options.  See Debian
  bug #404148 for full details.

Thanks,
-- 
Steve Langasek   Give me a lever long enough and a Free OS
Debian Developer   to set it on, and I can move the world.
[EMAIL PROTECTED]   http://www.debian.org/


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#404148: kernel: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!

2007-03-12 Thread Sven Luther
On Mon, Mar 12, 2007 at 11:25:13PM -0700, Steve Langasek wrote:
> So regrettably, this bug went more or less unnoticed on the 'kernel'
> pseudopackage until now, and it does appear (based on the upstream
> discussion) to affect the etch kernels.  And in addition to it being noticed
> after the upload of 2.6.18.dfsg.1-11, there also doesn't seem to be a real
> upstream fix available for the problem yet.
> 
> There does seem to be a workaround available though, which is iommu=soft.
> At the moment, I'm doubtful that we could change the kernel to force this
> setting on only the nvidia chipsets in time for etch.  Should we instead tag
> this bug etch-ignore, and refer the iommu=soft workaround to the release
> notes?

Could this also be related to my #414580 problems ? Will try the iommu=soft
option now. Mmm, ...

No, iommu=soft doesn't seem to help there.

Friendly,

Sven Luther


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#404148: kernel: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!

2007-03-12 Thread Steve Langasek
So regrettably, this bug went more or less unnoticed on the 'kernel'
pseudopackage until now, and it does appear (based on the upstream
discussion) to affect the etch kernels.  And in addition to it being noticed
after the upload of 2.6.18.dfsg.1-11, there also doesn't seem to be a real
upstream fix available for the problem yet.

There does seem to be a workaround available though, which is iommu=soft.
At the moment, I'm doubtful that we could change the kernel to force this
setting on only the nvidia chipsets in time for etch.  Should we instead tag
this bug etch-ignore, and refer the iommu=soft workaround to the release
notes?

Thanks,
-- 
Steve Langasek   Give me a lever long enough and a Free OS
Debian Developer   to set it on, and I can move the world.
[EMAIL PROTECTED]   http://www.debian.org/


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Bug#404148: kernel: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!

2006-12-21 Thread Christoph Anton Mitterer
Package: kernel
Severity: critical
Justification: causes serious data loss

Hi everybody.

I'm currently (together with others) investigating in a severe data
corruption problem that at least many users might suffer from.

A short description, when you validate lots of GBs over and over with
md5sums (or another hash) there are errors found.

We do not yet know the real reson for the problems but it might relate
to Opteron (and perhaps Athlon) CPUs and/or Nvidia chipsets (mainboard).
So it might be a hardware design error (but even a kernel error could be
possible).
This is definitely not a single hardware issue of my system as many
other users on lkml reported the problem (and we all did very extensive
hardware tests).

The error occurs only if on has so much memory that the system uses
memory mapping (and the hardware iommu).
At lkml we currently found two "solutions" (I consider them more
workarounds, as we don't know exactly why they're solving the problem):
1) Disabling memory hole mapping in the system BIOS. The downside is
that there is no memory hole mapping at all, and the users looses much
of his main memory (in my case 1,5 GB)
2) Setting iommu=soft. The users keeps it full memory, and in all our
tests (at least as far as I am informed), and we do very much tests as I
and someone else administer some big linux clusters,... the error did
_not_ occur.

Windows users do generally not suffer from this corruption, as Windows
(at least until Vista) was not able to make use of the hardware iommu,
and always uses the software iommu.
The Intel CPUs with EMT64/Intel64 don't suffer from that problem either,
as they don't have an hwiommu, too (at least as far as I know).

We are not yet sure if this is a large scale problem or affects only
some special hardware combinations. We do however think that the issue
occurs only with PCI-DMA accesses. (Tests showed, that when disabling
dma or at least using slower dma modes on the disks, the issue disappeared).
The problem is vendors (at least Nvidia) does not help very much, they
even didn't answer my mails.
And most "normal" users won't recognise this problem, as they don't have
enought main memory and even it they have the error occurs very rarely
(perhaps some 100 bytes every 30 GB <- only a very imprecise scale).

What I suggest know:
As this is a very grave I suggest

- to configure all the default kernels for etch that may be affected (as
far as I know that are the amd64-k8 and amd64-generic kernels. Perhaps
the i386 packages too, have a look at lkml for this) to use iommu=soft.
- to update all packages in sarge and woody (as far as they might be
affected)
- put some warnings in the packages where users might configure their
own kernel and the boot-loaders.

Have a look at this thread at lkml
http://marc.theaimsgroup.com/?t=11650212181&r=1&w=2 for in-depth
information.
It also contains links to some previous threads. There are also some
posts to lkml about this topics in separate threads (e.g. "amd64 iommu
causing corruption? (was Re: data corruption with nvidia chipsets and
IDE/SATA drives // memory hole mapping related bug?!)").

Best wishes,
Chris.

btw: please CC me as I'm off-list at the moment.
PS: I'll also write this the debian-kernel mailinglist.



-- System Information:
Debian Release: 4.0
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: amd64 (x86_64)
Shell:  /bin/sh linked to /bin/dash
Kernel: Linux 2.6.18
Locale: [EMAIL PROTECTED], [EMAIL PROTECTED] (charmap=UTF-8)


begin:vcard
fn:Mitterer, Christoph Anton
n:Mitterer;Christoph Anton
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard