Re: Linux 2.6.39-rc3

2011-05-06 Thread Linus Torvalds
On Wednesday, April 13, 2011, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Wednesday, April 13, 2011, H. Peter Anvin h...@zytor.com wrote:

 Yes.  However, even if we *do* revert (and the time is running short on
 not reverting) I would like to understand this particular one, simply
 because I think it may very well be a problem that is manifesting itself
 in other ways on other systems.

 sorry, fingerfart. Anyway, I agree 100%.

 we definitely want to also understand the reason for things not
working, even if we do revert..

Linus
 of complete b*llsh*t magic numbers in this

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-18 Thread Alex Deucher
On Sun, Apr 17, 2011 at 10:09 AM, Joerg Roedel j...@8bytes.org wrote:
 On Sat, Apr 16, 2011 at 02:54:04PM -0400, Jerome Glisse wrote:

 If you want to go the printk way you can add printk before each test
 ring_test, ib_test in r600.c this 2 functions are the own that might
 trigger the first GPU gart activities.

 Okay, I found the place in source that triggers this. It happens in the
 function r600_ib_test. The interesting thing is that not the ib-command
 itself is responsible but the fence that is emitted afterwards (proved
 by removing the fence command, where the problem went away).
 I don't know enough about the command semantics to make a guess what
 goes wrong there. But maybe you GPU folks have an idea?


I can't think of anything off hand.  It might be worth disabling the
call to r600_ib_test() in r600_init() and then seeing if you get any
errors when the fences are used later on when X starts or just at that
point in the module load sequence.  What's odd is that when you tested
radeon.no_wb=1 you got the same behavior as that disables shadowing of
fence writes to gpu gart mem, so it wouldn't be writing to memory in
that case.

Alex

        Joerg

 ___
 dri-devel mailing list
 dri-devel@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/dri-devel

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-18 Thread Jerome Glisse
On Mon, Apr 18, 2011 at 11:23 AM, Alex Deucher alexdeuc...@gmail.com wrote:
 On Sun, Apr 17, 2011 at 10:09 AM, Joerg Roedel j...@8bytes.org wrote:
 On Sat, Apr 16, 2011 at 02:54:04PM -0400, Jerome Glisse wrote:

 If you want to go the printk way you can add printk before each test
 ring_test, ib_test in r600.c this 2 functions are the own that might
 trigger the first GPU gart activities.

 Okay, I found the place in source that triggers this. It happens in the
 function r600_ib_test. The interesting thing is that not the ib-command
 itself is responsible but the fence that is emitted afterwards (proved
 by removing the fence command, where the problem went away).
 I don't know enough about the command semantics to make a guess what
 goes wrong there. But maybe you GPU folks have an idea?


 I can't think of anything off hand.  It might be worth disabling the
 call to r600_ib_test() in r600_init() and then seeing if you get any
 errors when the fences are used later on when X starts or just at that
 point in the module load sequence.  What's odd is that when you tested
 radeon.no_wb=1 you got the same behavior as that disables shadowing of
 fence writes to gpu gart mem, so it wouldn't be writing to memory in
 that case.

 Alex


It might be the irq ring write that is faulty.

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-18 Thread Jerome Glisse
On Mon, Apr 18, 2011 at 11:33 AM, Alex Deucher alexdeuc...@gmail.com wrote:
 On Mon, Apr 18, 2011 at 11:29 AM, Jerome Glisse j.gli...@gmail.com wrote:
 On Mon, Apr 18, 2011 at 11:23 AM, Alex Deucher alexdeuc...@gmail.com wrote:
 On Sun, Apr 17, 2011 at 10:09 AM, Joerg Roedel j...@8bytes.org wrote:
 On Sat, Apr 16, 2011 at 02:54:04PM -0400, Jerome Glisse wrote:

 If you want to go the printk way you can add printk before each test
 ring_test, ib_test in r600.c this 2 functions are the own that might
 trigger the first GPU gart activities.

 Okay, I found the place in source that triggers this. It happens in the
 function r600_ib_test. The interesting thing is that not the ib-command
 itself is responsible but the fence that is emitted afterwards (proved
 by removing the fence command, where the problem went away).
 I don't know enough about the command semantics to make a guess what
 goes wrong there. But maybe you GPU folks have an idea?


 I can't think of anything off hand.  It might be worth disabling the
 call to r600_ib_test() in r600_init() and then seeing if you get any
 errors when the fences are used later on when X starts or just at that
 point in the module load sequence.  What's odd is that when you tested
 radeon.no_wb=1 you got the same behavior as that disables shadowing of
 fence writes to gpu gart mem, so it wouldn't be writing to memory in
 that case.

 Alex


 It might be the irq ring write that is faulty.

 That's disabled with no_wb=1 as well.

 Alex


I mean the irq interrupt ring, i don't see this being disabled when no_wb=1

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-18 Thread Alex Deucher
On Mon, Apr 18, 2011 at 11:59 AM, Jerome Glisse j.gli...@gmail.com wrote:
 On Mon, Apr 18, 2011 at 11:33 AM, Alex Deucher alexdeuc...@gmail.com wrote:
 On Mon, Apr 18, 2011 at 11:29 AM, Jerome Glisse j.gli...@gmail.com wrote:
 On Mon, Apr 18, 2011 at 11:23 AM, Alex Deucher alexdeuc...@gmail.com 
 wrote:
 On Sun, Apr 17, 2011 at 10:09 AM, Joerg Roedel j...@8bytes.org wrote:
 On Sat, Apr 16, 2011 at 02:54:04PM -0400, Jerome Glisse wrote:

 If you want to go the printk way you can add printk before each test
 ring_test, ib_test in r600.c this 2 functions are the own that might
 trigger the first GPU gart activities.

 Okay, I found the place in source that triggers this. It happens in the
 function r600_ib_test. The interesting thing is that not the ib-command
 itself is responsible but the fence that is emitted afterwards (proved
 by removing the fence command, where the problem went away).
 I don't know enough about the command semantics to make a guess what
 goes wrong there. But maybe you GPU folks have an idea?


 I can't think of anything off hand.  It might be worth disabling the
 call to r600_ib_test() in r600_init() and then seeing if you get any
 errors when the fences are used later on when X starts or just at that
 point in the module load sequence.  What's odd is that when you tested
 radeon.no_wb=1 you got the same behavior as that disables shadowing of
 fence writes to gpu gart mem, so it wouldn't be writing to memory in
 that case.

 Alex


 It might be the irq ring write that is faulty.

 That's disabled with no_wb=1 as well.

 Alex


 I mean the irq interrupt ring, i don't see this being disabled when no_wb=1

I meant the IH ring pointer writeback.  The ih ring itself is still in
gart memory.

Alex


 Cheers,
 Jerome

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-17 Thread Joerg Roedel
On Sat, Apr 16, 2011 at 02:54:04PM -0400, Jerome Glisse wrote:

 If you want to go the printk way you can add printk before each test
 ring_test, ib_test in r600.c this 2 functions are the own that might
 trigger the first GPU gart activities.

Okay, I found the place in source that triggers this. It happens in the
function r600_ib_test. The interesting thing is that not the ib-command
itself is responsible but the fence that is emitted afterwards (proved
by removing the fence command, where the problem went away).
I don't know enough about the command semantics to make a guess what
goes wrong there. But maybe you GPU folks have an idea?

Joerg

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-17 Thread Jerome Glisse
On Sun, Apr 17, 2011 at 10:09 AM, Joerg Roedel j...@8bytes.org wrote:
 On Sat, Apr 16, 2011 at 02:54:04PM -0400, Jerome Glisse wrote:

 If you want to go the printk way you can add printk before each test
 ring_test, ib_test in r600.c this 2 functions are the own that might
 trigger the first GPU gart activities.

 Okay, I found the place in source that triggers this. It happens in the
 function r600_ib_test. The interesting thing is that not the ib-command
 itself is responsible but the fence that is emitted afterwards (proved
 by removing the fence command, where the problem went away).
 I don't know enough about the command semantics to make a guess what
 goes wrong there. But maybe you GPU folks have an idea?

        Joerg



I can't think of any theory, at that point the wb, irq ring, cp buffer
 ib pool are all allocated and pinned into gtt so they all have valid
entry backed by a real page. Maybe the GART flush  update is
seriously buggy but i expect we would have been hurt sooner by such
things. Maybe there is a bug in the hw... wouldn't be surprised. Will
try to think to crazy theory.

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-16 Thread Joerg Roedel
On Fri, Apr 15, 2011 at 09:06:41PM +0200, Ingo Molnar wrote:
 
 * Alexandre Demers alexandre.f.dem...@gmail.com wrote:
 
  On 11-04-15 10:27 AM, Joerg Roedel wrote:
   On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote:
   Ok, I'll test it today. Should I apply it on a clean rc3 without any of
   the other patches?
   Yes, apply it just on -rc3 without any other patch.
  
   BTW, may I suggest adding the info under bug 33012 in kernel bugzilla?
   This could be useful in the future.
   Cool, thanks
  
  
 Joerg
  The patch was applied and tested. It looks fine, I'm able to boot
  without problem.
 
 Joerg, mind submitting it with a changelog that includes everything we 
 learned 
 about this bug and all the Tested-by's in place?

Looks like I am too late, it is already applied. But the changelog
contains a link to the korg-bugzilla which has all information too. So
the information is not lost.

Joerg

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-16 Thread Joerg Roedel
On Fri, Apr 15, 2011 at 12:18:02PM -0700, Yinghai Lu wrote:
 On 04/15/2011 12:06 PM, Ingo Molnar wrote:
 
  
  Joerg, mind submitting it with a changelog that includes everything we 
  learned 
  about this bug and all the Tested-by's in place?
  
  Is anyone of the opinion that we should try to revert the allocation 
  order/alignment changes in addition to this fix?
 
 We should figure out what is written to 0xa0001000 (main memory) by GPU 
 before internal GART is setup.
 
 Joerg,
 can you insert some dump code in the drm/radon code to find out which
 function cause the problem?

I am not a GPU expert, but I will see what I can find out.

Joerg

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-16 Thread Ingo Molnar

* Joerg Roedel j...@8bytes.org wrote:

 On Fri, Apr 15, 2011 at 09:06:41PM +0200, Ingo Molnar wrote:
  
  * Alexandre Demers alexandre.f.dem...@gmail.com wrote:
  
   On 11-04-15 10:27 AM, Joerg Roedel wrote:
On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote:
Ok, I'll test it today. Should I apply it on a clean rc3 without any of
the other patches?
Yes, apply it just on -rc3 without any other patch.
   
BTW, may I suggest adding the info under bug 33012 in kernel bugzilla?
This could be useful in the future.
Cool, thanks
   
   
Joerg
   The patch was applied and tested. It looks fine, I'm able to boot
   without problem.
  
  Joerg, mind submitting it with a changelog that includes everything we 
  learned 
  about this bug and all the Tested-by's in place?
 
 Looks like I am too late, it is already applied. But the changelog
 contains a link to the korg-bugzilla which has all information too. So
 the information is not lost.

Yeah. In this case getting the fix into -rc4 in a timely manner looked more 
important than waiting for an updated changelog :-)

Thanks,

Ingo
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-16 Thread Joerg Roedel
On Fri, Apr 15, 2011 at 12:11:28PM -0400, Jerome Glisse wrote:
 Do you also got the write if you load radeon with radeon.no_wb=1 ?
 I think at this address it's the wb page, or maybe the cp as wb likely
 take only one page

radeon.no_wb=1 makes no difference. The box still reboots.

Joerg

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-16 Thread Jerome Glisse
On Sat, Apr 16, 2011 at 12:35 PM, Joerg Roedel j...@8bytes.org wrote:
 On Fri, Apr 15, 2011 at 12:11:28PM -0400, Jerome Glisse wrote:
 Do you also got the write if you load radeon with radeon.no_wb=1 ?
 I think at this address it's the wb page, or maybe the cp as wb likely
 take only one page

 radeon.no_wb=1 makes no difference. The box still reboots.

        Joerg



If you want to go the printk way you can add printk before each test
ring_test, ib_test in r600.c this 2 functions are the own that might
trigger the first GPU gart activities.

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-15 Thread Joerg Roedel
On Thu, Apr 14, 2011 at 05:34:46PM -0400, Alex Deucher wrote:
 On Thu, Apr 14, 2011 at 5:09 PM, Joerg Roedel j...@8bytes.org wrote:

  Actually, the nb gart is part of the cpu. It is part of the cpu north
  bridge and can translate io and cpu accesses. In fact, it is a remapper
  of physical memory addresses.
 
 I know what it's for.  In the IGP graphics chip is also part of the
 north bridge, but it may not be related at all.

Okay, just wanted to make clear that it is part of the CPU and not of
the chipset :)

  The problem seems to be related to specific gpu chips. On another
  notebook with an hd3000 card gtt and the nb gart aperture are both on
  0xa000 too but the box works fine. I havn't tested with an hd5000
  yet. The failing notebook has an hd4200 mobility.
 
 What exact model is the hd3000?   Is it IGP GPU or a discrete GPU?  It
 it's an IGP, it's identical to the hd4200 programming-wise.

It is an IGP card, an 

ATI Technologies Inc RS780M/RS780MN [Radeon HD 3200 Graphics]

according to lspci.

  Btw. what happens if the gpu accesses an unmapped address in the gtt
  range?
 
 It's redirected to a dummy page.

So there should be no issue too, this is a very weird bug.

Joerg

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-15 Thread Michel Dänzer
On Don, 2011-04-14 at 23:09 +0200, Joerg Roedel wrote: 
 On Thu, Apr 14, 2011 at 10:28:43AM -0400, Alex Deucher wrote:
  On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel j...@8bytes.org wrote:
   And this makes a difference, with this change on-top of -rc3 the box boots
   fine. So there seems to be some dependency between the GART base and the 
   GTT
   base even when they are in different address spaces.
  
   Alex, can you comment on this?
  
  As Dave said, they are completely different addresses spaces.  You
  could put the GPU aperture at 0 if you wanted (in fact we do on some
  chips).  Perhaps there's some strange interaction with the nb gart
  since the nb gart on that chipset was designed to be used for graphics
  and the rs780/880 can be configured to use an agp aperture.
  Unfortunately, I'm not that familiar with the nb gart.
 
 Actually, the nb gart is part of the cpu. It is part of the cpu north
 bridge and can translate io and cpu accesses. In fact, it is a remapper
 of physical memory addresses.
 
 The problem seems to be related to specific gpu chips. On another
 notebook with an hd3000 card gtt and the nb gart aperture are both on
 0xa000 too but the box works fine.

Wasn't the working theory that the problem occurs if those two values
aren't the same?


-- 
Earthling Michel Dänzer   |http://www.vmware.com
Libre software enthusiast |  Debian, X and DRI developer
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-15 Thread Joerg Roedel
On Fri, Apr 15, 2011 at 10:26:34AM +0200, Michel Dänzer wrote:
 On Don, 2011-04-14 at 23:09 +0200, Joerg Roedel wrote: 
  On Thu, Apr 14, 2011 at 10:28:43AM -0400, Alex Deucher wrote:
   On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel j...@8bytes.org wrote:
And this makes a difference, with this change on-top of -rc3 the box 
boots
fine. So there seems to be some dependency between the GART base and 
the GTT
base even when they are in different address spaces.
   
Alex, can you comment on this?
   
   As Dave said, they are completely different addresses spaces.  You
   could put the GPU aperture at 0 if you wanted (in fact we do on some
   chips).  Perhaps there's some strange interaction with the nb gart
   since the nb gart on that chipset was designed to be used for graphics
   and the rs780/880 can be configured to use an agp aperture.
   Unfortunately, I'm not that familiar with the nb gart.
  
  Actually, the nb gart is part of the cpu. It is part of the cpu north
  bridge and can translate io and cpu accesses. In fact, it is a remapper
  of physical memory addresses.
  
  The problem seems to be related to specific gpu chips. On another
  notebook with an hd3000 card gtt and the nb gart aperture are both on
  0xa000 too but the box works fine.
 
 Wasn't the working theory that the problem occurs if those two values
 aren't the same?

Yes it is, but this doesn't seem to be problematic on all readeon GPU
chips.

Joerg

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-15 Thread Joerg Roedel
On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote:
  we definitely want to also understand the reason for things not
 working, even if we do revert..

Okay, here it is.

After experimenting with different configurations for the north-bridge
it turned out that a GART related MCE fires at the time the machine
reboots. BIOSes configure the machine to sync-flood in that case which
causes a reboot.

After decoding the MCE it turned out to be a GART TBL Wlk Error. Such
errors can happen if devices (speculativly) access GART ranges mapped
invalid. The AMD BKDG for Fam10h CPUs recommends to disable these errors
at all. But unfortunatly some BIOSes (including the one on my laptop)
forget to do this.

Below is a patch which disables these errors if the BIOS didn't do it.
It fixes the problem on my site.

Alexandre, can you try this patch on your machine too, please?

Regards,

Joerg

From aaacff8db50b6ed4345e337ecbe53e505699c7e5 Mon Sep 17 00:00:00 2001
From: Joerg Roedel joerg.roe...@amd.com
Date: Fri, 15 Apr 2011 14:47:40 +0200
Subject: [PATCH] x86/amd: Disable GartTlbWlkErr when BIOS forgets it

This patch disables GartTlbWlk errors on AMD Fam10h CPUs if
the BIOS forgets to do is (or is just too old). Letting
these errors enabled can cause a sync-flood on the CPU
causing a reboot.

This patch is the fix for

https://bugzilla.kernel.org/show_bug.cgi?id=33012

on my machine.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/include/asm/msr-index.h |4 
 arch/x86/kernel/cpu/amd.c|   19 +++
 2 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index fd5a1f3..3cce714 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -96,11 +96,15 @@
 #define MSR_IA32_MC0_ADDR  0x0402
 #define MSR_IA32_MC0_MISC  0x0403
 
+#define MSR_AMD64_MC0_MASK 0xc0010044
+
 #define MSR_IA32_MCx_CTL(x)(MSR_IA32_MC0_CTL + 4*(x))
 #define MSR_IA32_MCx_STATUS(x) (MSR_IA32_MC0_STATUS + 4*(x))
 #define MSR_IA32_MCx_ADDR(x)   (MSR_IA32_MC0_ADDR + 4*(x))
 #define MSR_IA32_MCx_MISC(x)   (MSR_IA32_MC0_MISC + 4*(x))
 
+#define MSR_AMD64_MCx_MASK(x)  (MSR_AMD64_MC0_MASK + (x))
+
 /* These are consecutive and not in the normal 4er MCE bank block */
 #define MSR_IA32_MC0_CTL2  0x0280
 #define MSR_IA32_MCx_CTL2(x)   (MSR_IA32_MC0_CTL2 + (x))
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 3ecece0..3532d3b 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -615,6 +615,25 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c)
/* As a rule processors have APIC timer running in deep C states */
if (c-x86 = 0xf  !cpu_has_amd_erratum(amd_erratum_400))
set_cpu_cap(c, X86_FEATURE_ARAT);
+
+   /*
+* Disable GART TLB Walk Errors on Fam10h. We do this here
+* because this is always needed when GART is enabled, even in a
+* kernel which has no MCE support built in.
+*/
+   if (c-x86 == 0x10) {
+   /*
+* BIOS should disable GartTlbWlk Errors themself. If
+* it doesn't do it here as suggested by the BKDG.
+*
+* Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=33012
+*/
+   u64 mask;
+
+   rdmsrl(MSR_AMD64_MCx_MASK(4), mask);
+   mask |= (1  10);
+   wrmsrl(MSR_AMD64_MCx_MASK(4), mask);
+   }
 }
 
 #ifdef CONFIG_X86_32
-- 
1.7.1

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-15 Thread Ingo Molnar

* Joerg Roedel j...@8bytes.org wrote:

 On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote:
   we definitely want to also understand the reason for things not
  working, even if we do revert..
 
 Okay, here it is.
 
 After experimenting with different configurations for the north-bridge
 it turned out that a GART related MCE fires at the time the machine
 reboots. BIOSes configure the machine to sync-flood in that case which
 causes a reboot.
 
 After decoding the MCE it turned out to be a GART TBL Wlk Error. Such
 errors can happen if devices (speculativly) access GART ranges mapped
 invalid. The AMD BKDG for Fam10h CPUs recommends to disable these errors
 at all. But unfortunatly some BIOSes (including the one on my laptop)
 forget to do this.
 
 Below is a patch which disables these errors if the BIOS didn't do it.
 It fixes the problem on my site.

Ok, but how did the allocation changes start triggering this error in 
v2.6.39-rc1? There must still be some layout specific thing here, right?
Do we understand the details of that as well?

Thanks,

Ingo
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-15 Thread Joerg Roedel
On Fri, Apr 15, 2011 at 04:04:45PM +0200, Andreas Herrmann wrote:
 What about tagging this patch for stable/longterm releases?
 
 Potentially there are other cases where certain combinations of
 hardware(GPUs)/drivers/whatsoever might trigger a GartTlbWlkErr. If
 the BIOS doesn't follow the BKDG recommendation to mask these errors,
 the system will hang/reboot. Thus I think having this quirk in .32 and
 .38 (at least) is useful.

Right, thats certainly a good idea. The problem is not specific to GPUs,
any other device can trigger this too.

Joerg

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-15 Thread Joerg Roedel
On Fri, Apr 15, 2011 at 03:16:50PM +0200, Ingo Molnar wrote:
 Ok, but how did the allocation changes start triggering this error in 
 v2.6.39-rc1? There must still be some layout specific thing here, right?
 Do we understand the details of that as well?

Well, thinking again about this, the GPU likely generated this DMA
request before too (which has an address in the range configured for the
GTT on the card), but nobody noticed because they just hit main memory.
And with the allocation changes in 39-rc1 the GART aperture started to
be on the same address as the GTT (in their respective address spaces)
so that the DMA request hit the GART. This caused the MCE and the
sync-flood.
The open question is why the GPU generates a DMA request with an address
that is configured as the GTT base (+1 page) on the card.

Joerg

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-15 Thread Alex Deucher
On Fri, Apr 15, 2011 at 10:33 AM, Joerg Roedel j...@8bytes.org wrote:
 On Fri, Apr 15, 2011 at 03:16:50PM +0200, Ingo Molnar wrote:
 Ok, but how did the allocation changes start triggering this error in
 v2.6.39-rc1? There must still be some layout specific thing here, right?
 Do we understand the details of that as well?

 No, I must admit that I lack enough knowledge about the GPU hardware to
 make an guess how this tanslation-request happened. All I can tell is
 the address that was reported in the MCE, it is 0xa0001000 (==the second
 page of the GART aperture).

 Maybe Alex can help here. Alex, may it be possible that the GPU
 generates DMA requests in the GTT area before the GTT is activated (or
 the activation is completed)? Or can you imagine any other reason?

It shouldn't.  The driver binds a dummy page to all entries in the
table at init time and whenever the actual pages are unbound.

Alex
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-15 Thread Jerome Glisse
On Fri, Apr 15, 2011 at 11:46 AM, Joerg Roedel j...@8bytes.org wrote:
 On Fri, Apr 15, 2011 at 03:16:50PM +0200, Ingo Molnar wrote:
 Ok, but how did the allocation changes start triggering this error in
 v2.6.39-rc1? There must still be some layout specific thing here, right?
 Do we understand the details of that as well?

 Well, thinking again about this, the GPU likely generated this DMA
 request before too (which has an address in the range configured for the
 GTT on the card), but nobody noticed because they just hit main memory.
 And with the allocation changes in 39-rc1 the GART aperture started to
 be on the same address as the GTT (in their respective address spaces)
 so that the DMA request hit the GART. This caused the MCE and the
 sync-flood.
 The open question is why the GPU generates a DMA request with an address
 that is configured as the GTT base (+1 page) on the card.

        Joerg


Do you also got the write if you load radeon with radeon.no_wb=1 ?
I think at this address it's the wb page, or maybe the cp as wb likely
take only one page

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-15 Thread Andreas Herrmann
On Fri, Apr 15, 2011 at 03:11:52PM +0200, Joerg Roedel wrote:
 On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote:
   we definitely want to also understand the reason for things not
  working, even if we do revert..
 
 Okay, here it is.
 
 After experimenting with different configurations for the north-bridge
 it turned out that a GART related MCE fires at the time the machine
 reboots. BIOSes configure the machine to sync-flood in that case which
 causes a reboot.
 
 After decoding the MCE it turned out to be a GART TBL Wlk Error. Such
 errors can happen if devices (speculativly) access GART ranges mapped
 invalid. The AMD BKDG for Fam10h CPUs recommends to disable these errors
 at all. But unfortunatly some BIOSes (including the one on my laptop)
 forget to do this.
 
 Below is a patch which disables these errors if the BIOS didn't do it.
 It fixes the problem on my site.
 
 Alexandre, can you try this patch on your machine too, please?
 
 Regards,
 
   Joerg
 
 From aaacff8db50b6ed4345e337ecbe53e505699c7e5 Mon Sep 17 00:00:00 2001
 From: Joerg Roedel joerg.roe...@amd.com
 Date: Fri, 15 Apr 2011 14:47:40 +0200
 Subject: [PATCH] x86/amd: Disable GartTlbWlkErr when BIOS forgets it
 
 This patch disables GartTlbWlk errors on AMD Fam10h CPUs if
 the BIOS forgets to do is (or is just too old). Letting
 these errors enabled can cause a sync-flood on the CPU
 causing a reboot.
 
 This patch is the fix for
 
   https://bugzilla.kernel.org/show_bug.cgi?id=33012
 
 on my machine.
 
 Signed-off-by: Joerg Roedel joerg.roe...@amd.com


Joerg,

What about tagging this patch for stable/longterm releases?

Potentially there are other cases where certain combinations of
hardware(GPUs)/drivers/whatsoever might trigger a GartTlbWlkErr. If
the BIOS doesn't follow the BKDG recommendation to mask these errors,
the system will hang/reboot. Thus I think having this quirk in .32 and
.38 (at least) is useful.


Andreas
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-15 Thread Andreas Herrmann
On Thu, Apr 14, 2011 at 05:34:46PM -0400, Alex Deucher wrote:
 On Thu, Apr 14, 2011 at 5:09 PM, Joerg Roedel j...@8bytes.org wrote:
  On Thu, Apr 14, 2011 at 10:28:43AM -0400, Alex Deucher wrote:
  On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel j...@8bytes.org wrote:
   And this makes a difference, with this change on-top of -rc3 the box 
   boots
   fine. So there seems to be some dependency between the GART base and the 
   GTT
   base even when they are in different address spaces.
  
   Alex, can you comment on this?
 
  As Dave said, they are completely different addresses spaces.  You
  could put the GPU aperture at 0 if you wanted (in fact we do on some
  chips).  Perhaps there's some strange interaction with the nb gart
  since the nb gart on that chipset was designed to be used for graphics
  and the rs780/880 can be configured to use an agp aperture.
  Unfortunately, I'm not that familiar with the nb gart.
 
  Actually, the nb gart is part of the cpu. It is part of the cpu north
  bridge and can translate io and cpu accesses. In fact, it is a remapper
  of physical memory addresses.
 
 I know what it's for.  In the IGP graphics chip is also part of the
 north bridge, but it may not be related at all.
 
 
  The problem seems to be related to specific gpu chips. On another
  notebook with an hd3000 card gtt and the nb gart aperture are both on
  0xa000 too but the box works fine. I havn't tested with an hd5000
  yet. The failing notebook has an hd4200 mobility.
 
 What exact model is the hd3000?   Is it IGP GPU or a discrete GPU?  It
 it's an IGP, it's identical to the hd4200 programming-wise.

BTW, first of all the other notebook had a different CPU (it's family
0fh and Joerg's is family 10h). So different CPUs different GARTs
different issues ;-)

(Furthermore for CPU family 0fh reporting of GartTblWalk errors is
already switched off in arch/x86/kernel/cpu/mcheck/mce.c.)


Andreas
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-15 Thread Alexandre Demers
On 11-04-15 10:27 AM, Joerg Roedel wrote:
 On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote:
 Ok, I'll test it today. Should I apply it on a clean rc3 without any of
 the other patches?
 Yes, apply it just on -rc3 without any other patch.

 BTW, may I suggest adding the info under bug 33012 in kernel bugzilla?
 This could be useful in the future.
 Cool, thanks


   Joerg
The patch was applied and tested. It looks fine, I'm able to boot
without problem.

-- 
Alexandre Demers

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-15 Thread Ingo Molnar

* Alexandre Demers alexandre.f.dem...@gmail.com wrote:

 On 11-04-15 10:27 AM, Joerg Roedel wrote:
  On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote:
  Ok, I'll test it today. Should I apply it on a clean rc3 without any of
  the other patches?
  Yes, apply it just on -rc3 without any other patch.
 
  BTW, may I suggest adding the info under bug 33012 in kernel bugzilla?
  This could be useful in the future.
  Cool, thanks
 
 
  Joerg
 The patch was applied and tested. It looks fine, I'm able to boot
 without problem.

Joerg, mind submitting it with a changelog that includes everything we learned 
about this bug and all the Tested-by's in place?

Is anyone of the opinion that we should try to revert the allocation 
order/alignment changes in addition to this fix?

Thanks,

Ingo
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-15 Thread Yinghai Lu
On 04/15/2011 12:06 PM, Ingo Molnar wrote:

 
 Joerg, mind submitting it with a changelog that includes everything we 
 learned 
 about this bug and all the Tested-by's in place?
 
 Is anyone of the opinion that we should try to revert the allocation 
 order/alignment changes in addition to this fix?

We should figure out what is written to 0xa0001000 (main memory) by GPU before 
internal GART is setup.

Joerg,
can you insert some dump code in the drm/radon code to find out which function 
cause the problem?

Thanks

Yinghai
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-14 Thread H. Peter Anvin
On 04/13/2011 07:07 PM, Dave Airlie wrote:

 Okay, staring at this, it definitely seems toxic to overlay the GART
 over memory areas reserved by the BIOS.  If I were to guess, I would say
 that the problem here seems to be that the kernel thinks it is
 overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in
 size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas.

 Alex D., could you comment on the num cpu pages bit?
 
 These are not CPU addresses. I think we've stated that already. Not the
 droids.
 
 the num cpu pages is how many CPU pages would be needed to fill the GPU
 GTT, for those crazy cases where CPU pagesize != GPU pagesize.
 

OK, well, something is still weird.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-14 Thread Alan Cox
On Wed, 13 Apr 2011 19:33:40 -0700
Linus Torvalds torva...@linux-foundation.org wrote:

 On Wednesday, April 13, 2011, Linus Torvalds
 torva...@linux-foundation.org wrote:
  On Wednesday, April 13, 2011, H. Peter Anvin h...@zytor.com wrote:
 
  Yes.  However, even if we *do* revert (and the time is running short on
  not reverting) I would like to understand this particular one, simply
  because I think it may very well be a problem that is manifesting itself
  in other ways on other systems.
 
  sorry, fingerfart. Anyway, I agree 100%.
 
  we definitely want to also understand the reason for things not
 working, even if we do revert..

Definitely because if it fails when the magic involves the GART base it
starts to sound like something may be hitting the wrong address space or
not flushing properly.

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-14 Thread Joerg Roedel
On Wed, Apr 13, 2011 at 06:58:46PM -0700, H. Peter Anvin wrote:
 On 04/13/2011 12:14 PM, Yinghai Lu wrote:
  
  so looks bios program wrong address to the radon card?
  
 
 Okay, staring at this, it definitely seems toxic to overlay the GART
 over memory areas reserved by the BIOS.  If I were to guess, I would say
 that the problem here seems to be that the kernel thinks it is
 overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in
 size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas.
 
 Alex D., could you comment on the num cpu pages bit?

Okay, I tried the debug-patch from Yinghai (posted to the bugzilla):

--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -325,6 +325,8 @@ void radeon_gtt_location(struct radeon_device *rdev, struct 
radeon_mc *mc)
mc-gtt_size = size_bf;
}
mc-gtt_start = (mc-vram_start  ~mc-gtt_base_align) - 
mc-gtt_size;
+   if (mc-gtt_start == 0xa000)
+   mc-gtt_start = 0x8000;
} else {
if (mc-gtt_size  size_af) {
dev_warn(rdev-dev, limiting GTT\n);

And this makes a difference, with this change on-top of -rc3 the box boots
fine. So there seems to be some dependency between the GART base and the GTT
base even when they are in different address spaces.

Alex, can you comment on this?

Regards,

Joerg

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-14 Thread Dave Airlie
On Thu, Apr 14, 2011 at 6:56 PM, Joerg Roedel j...@8bytes.org wrote:
 On Wed, Apr 13, 2011 at 06:58:46PM -0700, H. Peter Anvin wrote:
 On 04/13/2011 12:14 PM, Yinghai Lu wrote:
 
  so looks bios program wrong address to the radon card?
 

 Okay, staring at this, it definitely seems toxic to overlay the GART
 over memory areas reserved by the BIOS.  If I were to guess, I would say
 that the problem here seems to be that the kernel thinks it is
 overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in
 size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas.

 Alex D., could you comment on the num cpu pages bit?

 Okay, I tried the debug-patch from Yinghai (posted to the bugzilla):

 --- a/drivers/gpu/drm/radeon/radeon_device.c
 +++ b/drivers/gpu/drm/radeon/radeon_device.c
 @@ -325,6 +325,8 @@ void radeon_gtt_location(struct radeon_device *rdev, 
 struct radeon_mc *mc)
                        mc-gtt_size = size_bf;
                }
                mc-gtt_start = (mc-vram_start  ~mc-gtt_base_align) - 
 mc-gtt_size;
 +               if (mc-gtt_start == 0xa000)
 +                       mc-gtt_start = 0x8000;
        } else {
                if (mc-gtt_size  size_af) {
                        dev_warn(rdev-dev, limiting GTT\n);

 And this makes a difference, with this change on-top of -rc3 the box boots
 fine. So there seems to be some dependency between the GART base and the GTT
 base even when they are in different address spaces.

 Alex, can you comment on this?

Wierd either a hw bug or some access to the GTT is leaking out before,
things are setup properly,

I think the RS780/880 docs are on the website, but generally the
address spaces are completely separate so anything getting through is
very unusual.

Dave.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-14 Thread Joerg Roedel
On Thu, Apr 14, 2011 at 01:03:37PM +0900, Tejun Heo wrote:
 Hello,
 
 On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote:
  On Wednesday, April 13, 2011, Linus Torvalds
  torva...@linux-foundation.org wrote:
   On Wednesday, April 13, 2011, H. Peter Anvin h...@zytor.com wrote:
  
   Yes.  However, even if we *do* revert (and the time is running short on
   not reverting) I would like to understand this particular one, simply
   because I think it may very well be a problem that is manifesting itself
   in other ways on other systems.
  
   sorry, fingerfart. Anyway, I agree 100%.
  
   we definitely want to also understand the reason for things not
  working, even if we do revert..
 
 There were (and still are) places where memblock callers implemented
 ad-hoc top-down allocation by stepping down start limit until
 allocation succeeds.  Several of them have been removed since top-down
 became the default behavior, so simply reverting the commit is likely
 to cause subtle issues.  Maybe the best approach is introducing
 @topdown parameter and use it selectively for pure memory allocations.

Wouldn't it be better to provide a seperate memblock allocation
function which operates top-down and use this one in the places that
need it? This way it wouldn't break code that relies on bottom-up.

Joerg

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-14 Thread Alex Deucher
On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel j...@8bytes.org wrote:
 On Wed, Apr 13, 2011 at 06:58:46PM -0700, H. Peter Anvin wrote:
 On 04/13/2011 12:14 PM, Yinghai Lu wrote:
 
  so looks bios program wrong address to the radon card?
 

 Okay, staring at this, it definitely seems toxic to overlay the GART
 over memory areas reserved by the BIOS.  If I were to guess, I would say
 that the problem here seems to be that the kernel thinks it is
 overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in
 size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas.

 Alex D., could you comment on the num cpu pages bit?

 Okay, I tried the debug-patch from Yinghai (posted to the bugzilla):

 --- a/drivers/gpu/drm/radeon/radeon_device.c
 +++ b/drivers/gpu/drm/radeon/radeon_device.c
 @@ -325,6 +325,8 @@ void radeon_gtt_location(struct radeon_device *rdev, 
 struct radeon_mc *mc)
                        mc-gtt_size = size_bf;
                }
                mc-gtt_start = (mc-vram_start  ~mc-gtt_base_align) - 
 mc-gtt_size;
 +               if (mc-gtt_start == 0xa000)
 +                       mc-gtt_start = 0x8000;
        } else {
                if (mc-gtt_size  size_af) {
                        dev_warn(rdev-dev, limiting GTT\n);

 And this makes a difference, with this change on-top of -rc3 the box boots
 fine. So there seems to be some dependency between the GART base and the GTT
 base even when they are in different address spaces.

 Alex, can you comment on this?

As Dave said, they are completely different addresses spaces.  You
could put the GPU aperture at 0 if you wanted (in fact we do on some
chips).  Perhaps there's some strange interaction with the nb gart
since the nb gart on that chipset was designed to be used for graphics
and the rs780/880 can be configured to use an agp aperture.
Unfortunately, I'm not that familiar with the nb gart.

Alex


 Regards,

        Joerg


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-14 Thread H. Peter Anvin
On 04/14/2011 02:11 AM, Ingo Molnar wrote:
 
 I'd strongly suggest we revert back to the old and proven allocation order, 
 as 
 long as it results in valid layouts. Even if we figure out this particular 
 GART/GTT assumption there might be a dozen others in other types of hardware.
 

Yes, but we might also be hiding a real bug which bites other hardware.
 We have found real and very serious bugs in the kernel this way before
-- things where drivers scribble over random memory and allocation order
exposed the failure in a predictable way, as opposed to random crashes.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-14 Thread Joerg Roedel
On Thu, Apr 14, 2011 at 10:28:43AM -0400, Alex Deucher wrote:
 On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel j...@8bytes.org wrote:
  And this makes a difference, with this change on-top of -rc3 the box boots
  fine. So there seems to be some dependency between the GART base and the GTT
  base even when they are in different address spaces.
 
  Alex, can you comment on this?
 
 As Dave said, they are completely different addresses spaces.  You
 could put the GPU aperture at 0 if you wanted (in fact we do on some
 chips).  Perhaps there's some strange interaction with the nb gart
 since the nb gart on that chipset was designed to be used for graphics
 and the rs780/880 can be configured to use an agp aperture.
 Unfortunately, I'm not that familiar with the nb gart.

Actually, the nb gart is part of the cpu. It is part of the cpu north
bridge and can translate io and cpu accesses. In fact, it is a remapper
of physical memory addresses.

The problem seems to be related to specific gpu chips. On another
notebook with an hd3000 card gtt and the nb gart aperture are both on
0xa000 too but the box works fine. I havn't tested with an hd5000
yet. The failing notebook has an hd4200 mobility.

Btw. what happens if the gpu accesses an unmapped address in the gtt
range?

Regards,

Joerg

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-14 Thread Alex Deucher
On Thu, Apr 14, 2011 at 5:09 PM, Joerg Roedel j...@8bytes.org wrote:
 On Thu, Apr 14, 2011 at 10:28:43AM -0400, Alex Deucher wrote:
 On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel j...@8bytes.org wrote:
  And this makes a difference, with this change on-top of -rc3 the box boots
  fine. So there seems to be some dependency between the GART base and the 
  GTT
  base even when they are in different address spaces.
 
  Alex, can you comment on this?

 As Dave said, they are completely different addresses spaces.  You
 could put the GPU aperture at 0 if you wanted (in fact we do on some
 chips).  Perhaps there's some strange interaction with the nb gart
 since the nb gart on that chipset was designed to be used for graphics
 and the rs780/880 can be configured to use an agp aperture.
 Unfortunately, I'm not that familiar with the nb gart.

 Actually, the nb gart is part of the cpu. It is part of the cpu north
 bridge and can translate io and cpu accesses. In fact, it is a remapper
 of physical memory addresses.

I know what it's for.  In the IGP graphics chip is also part of the
north bridge, but it may not be related at all.


 The problem seems to be related to specific gpu chips. On another
 notebook with an hd3000 card gtt and the nb gart aperture are both on
 0xa000 too but the box works fine. I havn't tested with an hd5000
 yet. The failing notebook has an hd4200 mobility.

What exact model is the hd3000?   Is it IGP GPU or a discrete GPU?  It
it's an IGP, it's identical to the hd4200 programming-wise.


 Btw. what happens if the gpu accesses an unmapped address in the gtt
 range?

It's redirected to a dummy page.

Alex
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread Ingo Molnar

* Joerg Roedel j...@8bytes.org wrote:

   The problem does not happen with 2.6.38. I try to bisect this further 
   down to a commit. Alex, please let me know if you need any further 
   information.
  
  If you can bisect it, that would be great.  Thanks,
 
 Bisecting actually gave a very weird result. It points to
 
   d2137d5af4259f50c19addb8246a186c9ffac325
 
 which is a merge-commit in the x86 tree. Even more weird is that this
 notebook is the only machine with these symptoms, all my other boxes are
 fine.

 During the bisect I tested commits from Yinghai which were good. It seems 
 like the problem appeared with the merge.

There's a similar looking bug being debugged here:

  https://bugzilla.kernel.org/show_bug.cgi?id=33012

Could you please send the before/after bootlog (in particular all memory init 
messages included) and your .config?

 before:  f005fe12b90c: x86-64: Move out cleanup higmap [_brk_end, _end) out of 
init_memory_mapping()
  after:  d2137d5af425: Merge branch 'linus' into x86/bootmem

I've Cc:-ed more people who might have an idea about it.

Thanks,

Ingo
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread Joerg Roedel
On Wed, Apr 13, 2011 at 08:46:09AM +0200, Ingo Molnar wrote:
 Could you please send the before/after bootlog (in particular all memory init 
 messages included) and your .config?
 
  before:  f005fe12b90c: x86-64: Move out cleanup higmap [_brk_end, _end) out 
 of init_memory_mapping()
   after:  d2137d5af425: Merge branch 'linus' into x86/bootmem
 
 I've Cc:-ed more people who might have an idea about it.

Okay, I have done some more bisecting and debugging today.

First of all, I bisected between v2.6.37-rc2..f005fe12b90c which where
only a couple of patches and merged v2.6.38-rc4 in at every step. There
was no failure found.
Then I tried this again, but this time I merged v2.6.38-rc5 at every
step and was successful. The bad commit in this branch turned out to be

1a4a678b12c84db9ae5dce424e0e97f0559bb57c

which is related to memblock.

Then I tried to find out which change between 2.6.38-rc4 and 2.6.38-rc5
is needed to trigger the failure, so I used f005fe12b90c as a base,
bisected between v2.6.38-rc4..v2.6.38-rc5 and merged every bisect step
into the base and tested. Here the bad commit turned out to be

e6d2e2b2b1e1455df16d68a78f4a3874c7b3ad20

which is related to gart. It turned out that the gart aperture on that
box is on another position with these patches. Before it was as
0xa400 and now it is at 0xa000. It seems like this has something
to do with the root-cause.

Reverting commit 1a4a678b12c84db9ae5dce424e0e97f0559bb57c fixes the
problem btw. and booting with iommu=soft also works, but I have no idea
yet why the aperture at that address is a problem (with the patch
reverted the aperture lands at 0x8000).

I have put some debug-data online. There is my .config and two
dmesg-files for good (==2.6.39-rc3 + revert) and bad (==2.6.39-rc3)
I also created these dmesg-files again with memblock=debug, maybe that
helps to find the problem. The files are at

http://www.8bytes.org/~joro/debug/

Or someone else has an idea about the issue...

Joerg

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread H. Peter Anvin
On 04/13/2011 10:21 AM, Joerg Roedel wrote:
 
 First of all, I bisected between v2.6.37-rc2..f005fe12b90c which where
 only a couple of patches and merged v2.6.38-rc4 in at every step. There
 was no failure found.
 Then I tried this again, but this time I merged v2.6.38-rc5 at every
 step and was successful. The bad commit in this branch turned out to be
 
   1a4a678b12c84db9ae5dce424e0e97f0559bb57c
 
 which is related to memblock.
 
 Then I tried to find out which change between 2.6.38-rc4 and 2.6.38-rc5
 is needed to trigger the failure, so I used f005fe12b90c as a base,
 bisected between v2.6.38-rc4..v2.6.38-rc5 and merged every bisect step
 into the base and tested. Here the bad commit turned out to be
 
   e6d2e2b2b1e1455df16d68a78f4a3874c7b3ad20
 
 which is related to gart. It turned out that the gart aperture on that
 box is on another position with these patches. Before it was as
 0xa400 and now it is at 0xa000. It seems like this has something
 to do with the root-cause.
 
 Reverting commit 1a4a678b12c84db9ae5dce424e0e97f0559bb57c fixes the
 problem btw. and booting with iommu=soft also works, but I have no idea
 yet why the aperture at that address is a problem (with the patch
 reverted the aperture lands at 0x8000).
 

Does reverting e6d2e2b2b1e1455df16d68a78f4a3874c7b3ad20 solve the
problem for you?

1a4a678b12c84db9ae5dce424e0e97f0559bb57c is a memory-allocation-order
patch, which have a nasty tendency to unmask bugs elsewhere in the
kernel.  However, e6d2e2b2b1e1455df16d68a78f4a3874c7b3ad20 looks
positively strange (and it doesn't exactly help that the description is
written in Yinghai-ese and is therefore nearly impossible to decode,
never mind tell if it is remotely correct.)

-hpa


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread Yinghai Lu
On 04/13/2011 10:21 AM, Joerg Roedel wrote:
 On Wed, Apr 13, 2011 at 08:46:09AM +0200, Ingo Molnar wrote:
 First of all, I bisected between v2.6.37-rc2..f005fe12b90c which where
 only a couple of patches and merged v2.6.38-rc4 in at every step. There
 was no failure found.
 Then I tried this again, but this time I merged v2.6.38-rc5 at every
 step and was successful. The bad commit in this branch turned out to be
 
   1a4a678b12c84db9ae5dce424e0e97f0559bb57c
 
 which is related to memblock.
 
 Then I tried to find out which change between 2.6.38-rc4 and 2.6.38-rc5
 is needed to trigger the failure, so I used f005fe12b90c as a base,
 bisected between v2.6.38-rc4..v2.6.38-rc5 and merged every bisect step
 into the base and tested. Here the bad commit turned out to be
 
   e6d2e2b2b1e1455df16d68a78f4a3874c7b3ad20
 
 which is related to gart. It turned out that the gart aperture on that
 box is on another position with these patches. Before it was as
 0xa400 and now it is at 0xa000. It seems like this has something
 to do with the root-cause.
 
 Reverting commit 1a4a678b12c84db9ae5dce424e0e97f0559bb57c fixes the
 problem btw. and booting with iommu=soft also works, but I have no idea
 yet why the aperture at that address is a problem (with the patch
 reverted the aperture lands at 0x8000).
 
 I have put some debug-data online. There is my .config and two
 dmesg-files for good (==2.6.39-rc3 + revert) and bad (==2.6.39-rc3)
 I also created these dmesg-files again with memblock=debug, maybe that
 helps to find the problem. The files are at
 
   http://www.8bytes.org/~joro/debug/

thanks for the bisecting...

so those two patches uncover some problems.

[0.00] Checking aperture...
[0.00] No AGP bridge found
[0.00] Node 0: aperture @ a000 size 32 MB
[0.00] Aperture pointing to e820 RAM. Ignoring.
[0.00] Your BIOS doesn't leave a aperture memory hole
[0.00] Please enable the IOMMU option in the BIOS setup
[0.00] This costs you 64 MB of RAM
[0.00] memblock_x86_reserve_range: [0xa000-0xa3ff]   
aperture64
[0.00] Mapping aperture over 65536 KB of RAM @ a000

so kernel try to reallocate apperture. because BIOS allocated is pointed to RAM 
or size is too small.

but your radeon does use [0xa000, 0xbfff)

[4.281993] radeon :01:05.0: VRAM: 320M 0xC000 - 
0xD3FF (320M used)
[4.290672] radeon :01:05.0: GTT: 512M 0xA000 - 
0xBFFF
[4.298550] [drm] Detected VRAM RAM=320M, BAR=256M
[4.309857] [drm] RAM width 32bits DDR
[4.313748] [TTM] Zone  kernel: Available graphics memory: 1896524 kiB.
[4.320379] [TTM] Initializing pool allocator.
[4.324948] [drm] radeon: 320M of VRAM memory ready
[4.329832] [drm] radeon: 512M of GTT memory ready.

and the one seems working:

[0.00] Checking aperture...
[0.00] No AGP bridge found
[0.00] Node 0: aperture @ a000 size 32 MB
[0.00] Aperture pointing to e820 RAM. Ignoring.
[0.00] Your BIOS doesn't leave a aperture memory hole
[0.00] Please enable the IOMMU option in the BIOS setup
[0.00] This costs you 64 MB of RAM
[0.00] memblock_x86_reserve_range: [0x8000-0x83ff]   
aperture64
[0.00] Mapping aperture over 65536 KB of RAM @ 8000
[0.00] memblock_x86_reserve_range: [0xacb6bdc0-0xacb6bddf]  
BOOTMEM

will use different position...

[4.250159] radeon :01:05.0: VRAM: 320M 0xC000 - 
0xD3FF (320M used)
[4.258830] radeon :01:05.0: GTT: 512M 0xA000 - 
0xBFFF
[4.266742] [drm] Detected VRAM RAM=320M, BAR=256M
[4.271549] [drm] RAM width 32bits DDR
[4.275435] [TTM] Zone  kernel: Available graphics memory: 1896526 kiB.
[4.282066] [TTM] Initializing pool allocator.
[4.282085] usb 7-2: new full speed USB device number 2 using ohci_hcd
[4.293076] [drm] radeon: 320M of VRAM memory ready
[4.298277] [drm] radeon: 512M of GTT memory ready.
[4.303218] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[4.309854] [drm] Driver supports precise vblank timestamp query.
[4.315970] [drm] radeon: irq initialized.
[4.320094] [drm] GART: num cpu pages 131072, num gpu pages 131072

So question is why radeon is using the address [0xa000 - 0xc00], and in 
E820 it is RAM 

[0.00]  BIOS-e820: 0010 - acb8d000 (usable)
[0.00]  BIOS-e820: acb8d000 - acb8f000 (reserved)
[0.00]  BIOS-e820: acb8f000 - afce9000 (usable)
[0.00]  BIOS-e820: afce9000 - afd21000 (reserved)
[0.00]  BIOS-e820: afd21000 - afd4f000 (usable)
[0.00]  BIOS-e820: afd4f000 - afdcf000 (reserved)
[0.00]  BIOS-e820: afdcf000 - afecf000 (ACPI NVS)
[  

Re: Linux 2.6.39-rc3

2011-04-13 Thread H. Peter Anvin
On 04/13/2011 10:21 AM, Joerg Roedel wrote:
 On Wed, Apr 13, 2011 at 08:46:09AM +0200, Ingo Molnar wrote:
 Could you please send the before/after bootlog (in particular all memory 
 init 
 messages included) and your .config?

  before:  f005fe12b90c: x86-64: Move out cleanup higmap [_brk_end, _end) out 
 of init_memory_mapping()
   after:  d2137d5af425: Merge branch 'linus' into x86/bootmem

 I've Cc:-ed more people who might have an idea about it.
 
 Okay, I have done some more bisecting and debugging today.
 

First of all, *huge* thanks for this effort.  At least we need to track
down the bits that need to be reverted -- it is past rc3, and it's time
to see what we should revert and tell the submitter to try again next cycle.

This looks to be the same issue as in bugzilla 33012:

https://bugzilla.kernel.org/show_bug.cgi?id=33012

... so it would be good if we could keep the information in there.

-hpa
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread Joerg Roedel
On Wed, Apr 13, 2011 at 11:51:39AM -0700, H. Peter Anvin wrote:
 On 04/13/2011 10:21 AM, Joerg Roedel wrote:
  
  First of all, I bisected between v2.6.37-rc2..f005fe12b90c which where
  only a couple of patches and merged v2.6.38-rc4 in at every step. There
  was no failure found.
  Then I tried this again, but this time I merged v2.6.38-rc5 at every
  step and was successful. The bad commit in this branch turned out to be
  
  1a4a678b12c84db9ae5dce424e0e97f0559bb57c
  
  which is related to memblock.
  
  Then I tried to find out which change between 2.6.38-rc4 and 2.6.38-rc5
  is needed to trigger the failure, so I used f005fe12b90c as a base,
  bisected between v2.6.38-rc4..v2.6.38-rc5 and merged every bisect step
  into the base and tested. Here the bad commit turned out to be
  
  e6d2e2b2b1e1455df16d68a78f4a3874c7b3ad20
  
  which is related to gart. It turned out that the gart aperture on that
  box is on another position with these patches. Before it was as
  0xa400 and now it is at 0xa000. It seems like this has something
  to do with the root-cause.
  
  Reverting commit 1a4a678b12c84db9ae5dce424e0e97f0559bb57c fixes the
  problem btw. and booting with iommu=soft also works, but I have no idea
  yet why the aperture at that address is a problem (with the patch
  reverted the aperture lands at 0x8000).
  
 
 Does reverting e6d2e2b2b1e1455df16d68a78f4a3874c7b3ad20 solve the
 problem for you?

No, reverting that patch doesn't make the problem go away (and the gart
aperture is still on 0xa000). I tested this in 39-rc3, I havn't
tested if it makes a difference on the original bisect-commit from Ingo,
probably it does (don't know if that matters).
Strange about this commit is that it fixes an x86 gart aperture
allocation bug in generic memblock code.

 1a4a678b12c84db9ae5dce424e0e97f0559bb57c is a memory-allocation-order
 patch, which have a nasty tendency to unmask bugs elsewhere in the
 kernel.  However, e6d2e2b2b1e1455df16d68a78f4a3874c7b3ad20 looks
 positively strange (and it doesn't exactly help that the description is
 written in Yinghai-ese and is therefore nearly impossible to decode,
 never mind tell if it is remotely correct.)

I think that the two commits are okay and the bug is somewhere else, but
I have no idea yet were to look next. I spent some time looking at
radeon code and talking to Alex about it (because it seemed suspicous
that the GTT is on 0xa000 too, but as Alex explained me this is an
address in the GPU address space and shouldn't matter).

Regards,

   Joerg

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread Joerg Roedel
On Wed, Apr 13, 2011 at 11:39:29AM -0700, H. Peter Anvin wrote:
 On 04/13/2011 10:21 AM, Joerg Roedel wrote:
  On Wed, Apr 13, 2011 at 08:46:09AM +0200, Ingo Molnar wrote:
  Could you please send the before/after bootlog (in particular all memory 
  init 
  messages included) and your .config?
 
   before:  f005fe12b90c: x86-64: Move out cleanup higmap [_brk_end, _end) 
  out of init_memory_mapping()
after:  d2137d5af425: Merge branch 'linus' into x86/bootmem
 
  I've Cc:-ed more people who might have an idea about it.
  
  Okay, I have done some more bisecting and debugging today.
  
 
 First of all, *huge* thanks for this effort.  At least we need to track
 down the bits that need to be reverted -- it is past rc3, and it's time
 to see what we should revert and tell the submitter to try again next cycle.
 
 This looks to be the same issue as in bugzilla 33012:
 
   https://bugzilla.kernel.org/show_bug.cgi?id=33012
 
 ... so it would be good if we could keep the information in there.

Yes, I try to find my korg bugzilla account again and drop the
information from this email there.

Joerg

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread Joerg Roedel
On Wed, Apr 13, 2011 at 12:14:55PM -0700, Yinghai Lu wrote:
 thanks for the bisecting...
 
 so those two patches uncover some problems.
 
 [0.00] Checking aperture...
 [0.00] No AGP bridge found
 [0.00] Node 0: aperture @ a000 size 32 MB
 [0.00] Aperture pointing to e820 RAM. Ignoring.
 [0.00] Your BIOS doesn't leave a aperture memory hole
 [0.00] Please enable the IOMMU option in the BIOS setup
 [0.00] This costs you 64 MB of RAM
 [0.00] memblock_x86_reserve_range: [0xa000-0xa3ff]   
 aperture64
 [0.00] Mapping aperture over 65536 KB of RAM @ a000
 
 so kernel try to reallocate apperture. because BIOS allocated is pointed to 
 RAM or size is too small.

It is actually beyond 4GB on that machine, this value read here is from
the previous kernel-boot. The BIOS does not reset these values on a
reboot.

 but your radeon does use [0xa000, 0xbfff)

Yes, I suspected that too (and spent a few hours reading radeon code),
but then I talked the Alex Deucher and he explained that these addresses
which the driver prints for GTT and VRAM are in the GPU address space
and do not refer to system ram. So this shouldn't be the problem.

Joerg

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread Alex Deucher
On Wed, Apr 13, 2011 at 3:14 PM, Yinghai Lu ying...@kernel.org wrote:
 On 04/13/2011 10:21 AM, Joerg Roedel wrote:
 On Wed, Apr 13, 2011 at 08:46:09AM +0200, Ingo Molnar wrote:
 First of all, I bisected between v2.6.37-rc2..f005fe12b90c which where
 only a couple of patches and merged v2.6.38-rc4 in at every step. There
 was no failure found.
 Then I tried this again, but this time I merged v2.6.38-rc5 at every
 step and was successful. The bad commit in this branch turned out to be

       1a4a678b12c84db9ae5dce424e0e97f0559bb57c

 which is related to memblock.

 Then I tried to find out which change between 2.6.38-rc4 and 2.6.38-rc5
 is needed to trigger the failure, so I used f005fe12b90c as a base,
 bisected between v2.6.38-rc4..v2.6.38-rc5 and merged every bisect step
 into the base and tested. Here the bad commit turned out to be

       e6d2e2b2b1e1455df16d68a78f4a3874c7b3ad20

 which is related to gart. It turned out that the gart aperture on that
 box is on another position with these patches. Before it was as
 0xa400 and now it is at 0xa000. It seems like this has something
 to do with the root-cause.

 Reverting commit 1a4a678b12c84db9ae5dce424e0e97f0559bb57c fixes the
 problem btw. and booting with iommu=soft also works, but I have no idea
 yet why the aperture at that address is a problem (with the patch
 reverted the aperture lands at 0x8000).

 I have put some debug-data online. There is my .config and two
 dmesg-files for good (==2.6.39-rc3 + revert) and bad (==2.6.39-rc3)
 I also created these dmesg-files again with memblock=debug, maybe that
 helps to find the problem. The files are at

       http://www.8bytes.org/~joro/debug/

 thanks for the bisecting...

 so those two patches uncover some problems.

 [    0.00] Checking aperture...
 [    0.00] No AGP bridge found
 [    0.00] Node 0: aperture @ a000 size 32 MB
 [    0.00] Aperture pointing to e820 RAM. Ignoring.
 [    0.00] Your BIOS doesn't leave a aperture memory hole
 [    0.00] Please enable the IOMMU option in the BIOS setup
 [    0.00] This costs you 64 MB of RAM
 [    0.00]     memblock_x86_reserve_range: [0xa000-0xa3ff]       
 aperture64
 [    0.00] Mapping aperture over 65536 KB of RAM @ a000

 so kernel try to reallocate apperture. because BIOS allocated is pointed to 
 RAM or size is too small.

 but your radeon does use [0xa000, 0xbfff)

 [    4.281993] radeon :01:05.0: VRAM: 320M 0xC000 - 
 0xD3FF (320M used)
 [    4.290672] radeon :01:05.0: GTT: 512M 0xA000 - 
 0xBFFF
 [    4.298550] [drm] Detected VRAM RAM=320M, BAR=256M
 [    4.309857] [drm] RAM width 32bits DDR
 [    4.313748] [TTM] Zone  kernel: Available graphics memory: 1896524 kiB.
 [    4.320379] [TTM] Initializing pool allocator.
 [    4.324948] [drm] radeon: 320M of VRAM memory ready
 [    4.329832] [drm] radeon: 512M of GTT memory ready.

 and the one seems working:

 [    0.00] Checking aperture...
 [    0.00] No AGP bridge found
 [    0.00] Node 0: aperture @ a000 size 32 MB
 [    0.00] Aperture pointing to e820 RAM. Ignoring.
 [    0.00] Your BIOS doesn't leave a aperture memory hole
 [    0.00] Please enable the IOMMU option in the BIOS setup
 [    0.00] This costs you 64 MB of RAM
 [    0.00]     memblock_x86_reserve_range: [0x8000-0x83ff]       
 aperture64
 [    0.00] Mapping aperture over 65536 KB of RAM @ 8000
 [    0.00]     memblock_x86_reserve_range: [0xacb6bdc0-0xacb6bddf]        
   BOOTMEM

 will use different position...

 [    4.250159] radeon :01:05.0: VRAM: 320M 0xC000 - 
 0xD3FF (320M used)
 [    4.258830] radeon :01:05.0: GTT: 512M 0xA000 - 
 0xBFFF
 [    4.266742] [drm] Detected VRAM RAM=320M, BAR=256M
 [    4.271549] [drm] RAM width 32bits DDR
 [    4.275435] [TTM] Zone  kernel: Available graphics memory: 1896526 kiB.
 [    4.282066] [TTM] Initializing pool allocator.
 [    4.282085] usb 7-2: new full speed USB device number 2 using ohci_hcd
 [    4.293076] [drm] radeon: 320M of VRAM memory ready
 [    4.298277] [drm] radeon: 512M of GTT memory ready.
 [    4.303218] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
 [    4.309854] [drm] Driver supports precise vblank timestamp query.
 [    4.315970] [drm] radeon: irq initialized.
 [    4.320094] [drm] GART: num cpu pages 131072, num gpu pages 131072

 So question is why radeon is using the address [0xa000 - 0xc00], and 
 in E820 it is RAM 

The VRAM and GTT addresses in the dmesg are internal GPU addresses not
system addresses.  The GPU has it's own internal address space for
on-chip memory clients (texture samplers, render buffers, display
controllers, etc.).  The GPU sets up two apertures in it's internal
address space and on-chip client requests are forwarded to the
appropriate place by the GPU's memory controller.  Addresses 

Re: Linux 2.6.39-rc3

2011-04-13 Thread Yinghai Lu
On 04/13/2011 12:34 PM, Joerg Roedel wrote:
 On Wed, Apr 13, 2011 at 12:14:55PM -0700, Yinghai Lu wrote:
 thanks for the bisecting...

 so those two patches uncover some problems.

 [0.00] Checking aperture...
 [0.00] No AGP bridge found
 [0.00] Node 0: aperture @ a000 size 32 MB
 [0.00] Aperture pointing to e820 RAM. Ignoring.
 [0.00] Your BIOS doesn't leave a aperture memory hole
 [0.00] Please enable the IOMMU option in the BIOS setup
 [0.00] This costs you 64 MB of RAM
 [0.00] memblock_x86_reserve_range: [0xa000-0xa3ff]   
 aperture64
 [0.00] Mapping aperture over 65536 KB of RAM @ a000

 so kernel try to reallocate apperture. because BIOS allocated is pointed to 
 RAM or size is too small.
 
 It is actually beyond 4GB on that machine, this value read here is from
 the previous kernel-boot. The BIOS does not reset these values on a
 reboot.
 
 but your radeon does use [0xa000, 0xbfff)
 
 Yes, I suspected that too (and spent a few hours reading radeon code),
 but then I talked the Alex Deucher and he explained that these addresses
 which the driver prints for GTT and VRAM are in the GPU address space
 and do not refer to system ram. So this shouldn't be the problem.


can you try following change ? it will push gart to 0x8000

diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
index 86d1ad4..3b6a9d5 100644
--- a/arch/x86/kernel/aperture_64.c
+++ b/arch/x86/kernel/aperture_64.c
@@ -83,7 +83,7 @@ static u32 __init allocate_aperture(void)
 * so don't use 512M below as gart iommu, leave the space for kernel
 * code for safe
 */
-   addr = memblock_find_in_range(0, 1ULL32, aper_size, 512ULL20);
+   addr = memblock_find_in_range(0, 1ULL32, aper_size, 512ULL21);
if (addr == MEMBLOCK_ERROR || addr + aper_size  0x) {
printk(KERN_ERR
Cannot allocate aperture memory hole (%lx,%uK)\n,
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread Linus Torvalds
On Wed, Apr 13, 2011 at 1:48 PM, Yinghai Lu ying...@kernel.org wrote:

 can you try following change ? it will push gart to 0x8000

 diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
 index 86d1ad4..3b6a9d5 100644
 --- a/arch/x86/kernel/aperture_64.c
 +++ b/arch/x86/kernel/aperture_64.c
 @@ -83,7 +83,7 @@ static u32 __init allocate_aperture(void)
         * so don't use 512M below as gart iommu, leave the space for kernel
         * code for safe
         */
 -       addr = memblock_find_in_range(0, 1ULL32, aper_size, 512ULL20);
 +       addr = memblock_find_in_range(0, 1ULL32, aper_size, 512ULL21);

What are all the magic numbers, and why would 0x8000 be special?

Why don't we write code that just works?

Or absent a just works set of patches, why don't we revert to code
that has years of testing?

This kind of I broke things, so now I will jiggle things randomly
until they unbreak is not acceptable.

Either explain why that fixes a real BUG (and why the magic constants
need to be what they are), or just revert the patch that caused the
problem, and go back to the allocation patters that have years of
experience.

Guys, we've had this discussion before, in PCI allocation. We don't do
this. We tried switching the PCI region allocations to top-down, and
IT WAS A FAILURE. We reverted it to what we had years of testing with.

Don't just make random changes. There really are only two acceptable
models of development: think and analyze or years and years of
testing on thousands of machines. Those two really do work.

   Linus
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread Yinghai Lu
On 04/13/2011 01:54 PM, Linus Torvalds wrote:
 On Wed, Apr 13, 2011 at 1:48 PM, Yinghai Lu ying...@kernel.org wrote:

 can you try following change ? it will push gart to 0x8000

 diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
 index 86d1ad4..3b6a9d5 100644
 --- a/arch/x86/kernel/aperture_64.c
 +++ b/arch/x86/kernel/aperture_64.c
 @@ -83,7 +83,7 @@ static u32 __init allocate_aperture(void)
 * so don't use 512M below as gart iommu, leave the space for kernel
 * code for safe
 */
 -   addr = memblock_find_in_range(0, 1ULL32, aper_size, 512ULL20);
 +   addr = memblock_find_in_range(0, 1ULL32, aper_size, 512ULL21);
 
 What are all the magic numbers, and why would 0x8000 be special?

that is the old value when kernel was doing bottom-up bootmem allocation.

 
 Why don't we write code that just works?
 
 Or absent a just works set of patches, why don't we revert to code
 that has years of testing?
 
 This kind of I broke things, so now I will jiggle things randomly
 until they unbreak is not acceptable.
 
 Either explain why that fixes a real BUG (and why the magic constants
 need to be what they are), or just revert the patch that caused the
 problem, and go back to the allocation patters that have years of
 experience.
 
 Guys, we've had this discussion before, in PCI allocation. We don't do
 this. We tried switching the PCI region allocations to top-down, and
 IT WAS A FAILURE. We reverted it to what we had years of testing with.
 
 Don't just make random changes. There really are only two acceptable
 models of development: think and analyze or years and years of
 testing on thousands of machines. Those two really do work.

We did do the analyzing, and only difference seems to be:
good one is using 0x8000
and bad one is using 0xa000.

We try to figure out if it needs low address and it happen to work 
because kernel was doing bottom up allocation.

Thanks

Yinghai
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread Joerg Roedel
On Wed, Apr 13, 2011 at 01:48:48PM -0700, Yinghai Lu wrote:
 - addr = memblock_find_in_range(0, 1ULL32, aper_size, 512ULL20);
 + addr = memblock_find_in_range(0, 1ULL32, aper_size, 512ULL21);

Btw, while looking at this code I wondered why the 512M goal is enforced
by the alignment. Start could be set to 512M instead and the alignment
can be aper_size as it should. Any reason for such a big alignment?

Joerg

P.S.: The box is still in the office, I will try this debug-patch
  tomorrow.

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread Yinghai Lu
On 04/13/2011 02:50 PM, Joerg Roedel wrote:
 On Wed, Apr 13, 2011 at 01:48:48PM -0700, Yinghai Lu wrote:
 -addr = memblock_find_in_range(0, 1ULL32, aper_size, 512ULL20);
 +addr = memblock_find_in_range(0, 1ULL32, aper_size, 512ULL21);
 
 Btw, while looking at this code I wondered why the 512M goal is enforced
 by the alignment. Start could be set to 512M instead and the alignment
 can be aper_size as it should. Any reason for such a big alignment?
 

when using bootmem, try to use big alignment (512M ), so we could avoid take 
ram range below 512M.

commit 7677b2ef6c0c4fddc84f6473f3863f40eb71821b
Author: Yinghai Lu yhlu.kernel.s...@gmail.com
Date:   Mon Apr 14 20:40:37 2008 -0700

x86_64: allocate gart aperture from 512M

because we try to reserve dma32 early, so we have chance to get aperture
from 64M.

with some sequence aperture allocated from RAM, could become E820_RESERVED.

and then if doing a kexec with a big kernel that uncompressed size is above
64M we could have a range conflict with still using gart.

So allocate gart aperture from 512M instead.

Also change the fallback_aper_order to 5, because we don't have chance to 
get
2G or 4G aperture.

We can change it back to 32M or make it equal to size.

 
 P.S.: The box is still in the office, I will try this debug-patch
   tomorrow.

Alexandre's system is working at 0xa400 with 2.6.38.2

So it is not low address problem. could be other reason like
some other code could need lower address.

Thanks

Yinghai
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread H. Peter Anvin
On 04/13/2011 02:50 PM, Joerg Roedel wrote:
 On Wed, Apr 13, 2011 at 01:48:48PM -0700, Yinghai Lu wrote:
 -addr = memblock_find_in_range(0, 1ULL32, aper_size, 512ULL20);
 +addr = memblock_find_in_range(0, 1ULL32, aper_size, 512ULL21);
 
 Btw, while looking at this code I wondered why the 512M goal is enforced
 by the alignment. Start could be set to 512M instead and the alignment
 can be aper_size as it should. Any reason for such a big alignment?
 
   Joerg
 
 P.S.: The box is still in the office, I will try this debug-patch
   tomorrow.

The only reason that I can think of is that the aperture itself can be
huge, and perhaps 512 MiB is the biggest such known.  512ULL21 is of
course a particularly moronic way to write 1 GiB, but it was a debug patch.

The value 512 MiB apparently comes from
7677b2ef6c0c4fddc84f6473f3863f40eb71821b, which is apparently totally ad
hoc; effectively it tries to prevent a collision with kexec by
hardcoding the kdump allocation as it sat at that point in time in the
GART assignment rules.

Yeah.  Brilliant.

-hpa

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread H. Peter Anvin
On 04/13/2011 02:59 PM, Yinghai Lu wrote:
 On 04/13/2011 02:50 PM, Joerg Roedel wrote:
 On Wed, Apr 13, 2011 at 01:48:48PM -0700, Yinghai Lu wrote:
 -   addr = memblock_find_in_range(0, 1ULL32, aper_size, 512ULL20);
 +   addr = memblock_find_in_range(0, 1ULL32, aper_size, 512ULL21);

 Btw, while looking at this code I wondered why the 512M goal is enforced
 by the alignment. Start could be set to 512M instead and the alignment
 can be aper_size as it should. Any reason for such a big alignment?

 
 when using bootmem, try to use big alignment (512M ), so we could avoid take 
 ram range below 512M.
 

Yes, his question was why on Earth are you using 0 as start if that is
the purpose.

On top of that, where the hell does the magic 512 MiB come from?  It
looks like it is either completly ad hoc, or it has something to do with
where the kexec kernel was allocated once upon a time.

-hpa
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread Joerg Roedel
On Wed, Apr 13, 2011 at 03:01:10PM -0700, H. Peter Anvin wrote:
 On 04/13/2011 02:50 PM, Joerg Roedel wrote:
  On Wed, Apr 13, 2011 at 01:48:48PM -0700, Yinghai Lu wrote:
  -  addr = memblock_find_in_range(0, 1ULL32, aper_size, 512ULL20);
  +  addr = memblock_find_in_range(0, 1ULL32, aper_size, 512ULL21);
  
  Btw, while looking at this code I wondered why the 512M goal is enforced
  by the alignment. Start could be set to 512M instead and the alignment
  can be aper_size as it should. Any reason for such a big alignment?
  
  Joerg
  
  P.S.: The box is still in the office, I will try this debug-patch
tomorrow.
 
 The only reason that I can think of is that the aperture itself can be
 huge, and perhaps 512 MiB is the biggest such known. 

Well, that would work as well by just using aper_size as alignment, the
aperture needs to be aligned on its size anyway. This code only runs
when Linux allocates the aperture itself and if I am mistaken is uses
always 64MB when doing this.

Joerg

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread H. Peter Anvin
On 04/13/2011 03:22 PM, Joerg Roedel wrote:
 On Wed, Apr 13, 2011 at 03:01:10PM -0700, H. Peter Anvin wrote:
 On 04/13/2011 02:50 PM, Joerg Roedel wrote:
 On Wed, Apr 13, 2011 at 01:48:48PM -0700, Yinghai Lu wrote:
 -  addr = memblock_find_in_range(0, 1ULL32, aper_size, 512ULL20);
 +  addr = memblock_find_in_range(0, 1ULL32, aper_size, 512ULL21);

 Btw, while looking at this code I wondered why the 512M goal is enforced
 by the alignment. Start could be set to 512M instead and the alignment
 can be aper_size as it should. Any reason for such a big alignment?

 Joerg

 P.S.: The box is still in the office, I will try this debug-patch
   tomorrow.

 The only reason that I can think of is that the aperture itself can be
 huge, and perhaps 512 MiB is the biggest such known. 
 
 Well, that would work as well by just using aper_size as alignment, the
 aperture needs to be aligned on its size anyway. This code only runs
 when Linux allocates the aperture itself and if I am mistaken is uses
 always 64MB when doing this.

Yes, I would agree with that.  The sane thing would be to set the base
to whatever address needs to be guarded against (WHICH SHOULD BE
MOTIVATED), and use aper_size as alignment, *unless* we are only using
the initial portion of a much larger hardware structure that needs
natural alignment (which isn't clear to me, I do know we sometimes use
only a fraction of the GART, but that doesn't mean we need to
naturally-align the entire thing, nor that 512 MiB is sufficient to do so.)

-hpa


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread Linus Torvalds
On Wed, Apr 13, 2011 at 2:23 PM, Yinghai Lu ying...@kernel.org wrote:

 What are all the magic numbers, and why would 0x8000 be special?

 that is the old value when kernel was doing bottom-up bootmem allocation.

I understand, BUT THAT IS STILL A TOTALLY MAGIC NUMBER!

It makes it come out the same ON THAT ONE MACHINE.  So no, it's not
the old value. It's a random value that gets the old value in one
specific case.

 Why don't we write code that just works?

 Or absent a just works set of patches, why don't we revert to code
 that has years of testing?

 This kind of I broke things, so now I will jiggle things randomly
 until they unbreak is not acceptable.

 Either explain why that fixes a real BUG (and why the magic constants
 need to be what they are), or just revert the patch that caused the
 problem, and go back to the allocation patters that have years of
 experience.

 Guys, we've had this discussion before, in PCI allocation. We don't do
 this. We tried switching the PCI region allocations to top-down, and
 IT WAS A FAILURE. We reverted it to what we had years of testing with.

 Don't just make random changes. There really are only two acceptable
 models of development: think and analyze or years and years of
 testing on thousands of machines. Those two really do work.

 We did do the analyzing, and only difference seems to be:

No.

Yinghai, we have had this discussion before, and dammit, you need to
understand the difference between understanding the problem and put
in random values until it works on one machine.

There was absolutely _zero_ analysis done. You do not actually
understand WHY the numbers matter. You just look at two random
numbers, and one works, the other does not. That's not analyzing.
That's just random number games.

If you cannot see and understand the difference between an actual
analytical solution where you _understand_ what the code is doing and
why, and random numbers that happen to work on one machine, I don't
know what to tell you.

 good one is using 0x8000
 and bad one is using 0xa000.

 We try to figure out if it needs low address and it happen to work
 because kernel was doing bottom up allocation.

No.

Let me repeat my point one more time.

You have TWO choices. Not more, not less:

 - choice #1: go back to the old allocation model. It's tested. It
doesn't regress. Admittedly we may not know exactly _why_ it works,
and it might not work on all machines, but it doesn't cause
regressions (ie the machines it doesn't work on it _never_ worked on).

   And this doesn't mean old value for that _one_ machine. It means
old value for _every_ machine. So it means we revert the whole
bottom-down thing entirely. Not just change one random number so that
the totally different allocation pattern happens to give the same
result on one particular machine.

   Quite frankly, I don't see the point of doing top-to-bottom anyway,
so I think we should do this regardless. Just revert the whole
allocate from top. It didn't work for PCI, it's not working for this
case either. Stop doing it.

 - Choice #2: understand exactly _what_ goes wrong, and fix it
analytically (ie by _understanding_ the problem, and being able to
solve it exactly, and in a way you can argue about without having to
resort to magic happens).

Now, the whole analytic approach (aka computer sciency approach),
where you can actually think about the problem without having any
pesky reality impact the solution is obviously the one we tend to
prefer. Sadly, it's seldom the one we can use in reality when it comes
to things like resource allocation, since we end up starting off with
often buggy approximations of what the actual hardware is all about
(ie broken firmware tables).

So I'd love to know exactly why one random number works, and why
another one doesn't. But as long as we do _not_ know the Why of it,
we will have to revert.

It really is that simple. It's _always_ that simple.

So the numbers shouldn't be magic, they should have real
explanations. And in the absense of real explanation, the model that
works is this is what we've always done. Including, very much, the
whole allocation order. Not just one random number on one random
machine.

Linus
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread Yinghai Lu
On 04/13/2011 04:39 PM, Linus Torvalds wrote:
 On Wed, Apr 13, 2011 at 2:23 PM, Yinghai Lu ying...@kernel.org wrote:

 What are all the magic numbers, and why would 0x8000 be special?

 that is the old value when kernel was doing bottom-up bootmem allocation.
 
 I understand, BUT THAT IS STILL A TOTALLY MAGIC NUMBER!
 
 It makes it come out the same ON THAT ONE MACHINE.  So no, it's not
 the old value. It's a random value that gets the old value in one
 specific case.

Alexandre's system is working 2.6.38.2 and kernel allocate from 0xa400
Joerg's system working 2.6.39-rc3 while revert the top down bootmem patch 
1a4a678b12c84db9ae5dce424e0e97f0559bb57c
and kernel allocate to 0x8000.
Alexandre's system is working while increasing alignment to 1g, and make kernel 
to
allocate 0x8000 to gart.

they are not working if kernel allocate from 0xa000

the 0xa000 looks like same value from radon GTT.


[4.250159] radeon :01:05.0: VRAM: 320M 0xC000 - 
0xD3FF (320M used)
[4.258830] radeon :01:05.0: GTT: 512M 0xA000 - 
0xBFFF
[4.266742] [drm] Detected VRAM RAM=320M, BAR=256M
[4.271549] [drm] RAM width 32bits DDR
[4.275435] [TTM] Zone  kernel: Available graphics memory: 1896526 kiB.
[4.282066] [TTM] Initializing pool allocator.
[4.282085] usb 7-2: new full speed USB device number 2 using ohci_hcd
[4.293076] [drm] radeon: 320M of VRAM memory ready
[4.298277] [drm] radeon: 512M of GTT memory ready.
[4.303218] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
[4.309854] [drm] Driver supports precise vblank timestamp query.
[4.315970] [drm] radeon: irq initialized.
[4.320094] [drm] GART: num cpu pages 131072, num gpu pages 131072

Alex said that 0xa000 is ok and is from GPU address space
---
The VRAM and GTT addresses in the dmesg are internal GPU addresses not
system addresses.  The GPU has it's own internal address space for
on-chip memory clients (texture samplers, render buffers, display
controllers, etc.).  The GPU sets up two apertures in it's internal
address space and on-chip client requests are forwarded to the
appropriate place by the GPU's memory controller.  Addresses in the
GPU's VRAM aperture go to local vram on discrete cards, or to the
stolen memory at the top of system memory for IGP cards.  Addresses in
the GPU's GTT aperture hit a page table and get forwarded to the
appropriate dma pages.
---

 
 Why don't we write code that just works?

 Or absent a just works set of patches, why don't we revert to code
 that has years of testing?

 This kind of I broke things, so now I will jiggle things randomly
 until they unbreak is not acceptable.

 Either explain why that fixes a real BUG (and why the magic constants
 need to be what they are), or just revert the patch that caused the
 problem, and go back to the allocation patters that have years of
 experience.

 Guys, we've had this discussion before, in PCI allocation. We don't do
 this. We tried switching the PCI region allocations to top-down, and
 IT WAS A FAILURE. We reverted it to what we had years of testing with.

 Don't just make random changes. There really are only two acceptable
 models of development: think and analyze or years and years of
 testing on thousands of machines. Those two really do work.

 We did do the analyzing, and only difference seems to be:
 
 No.
 
 Yinghai, we have had this discussion before, and dammit, you need to
 understand the difference between understanding the problem and put
 in random values until it works on one machine.
 
 There was absolutely _zero_ analysis done. You do not actually
 understand WHY the numbers matter. You just look at two random
 numbers, and one works, the other does not. That's not analyzing.
 That's just random number games.
 
 If you cannot see and understand the difference between an actual
 analytical solution where you _understand_ what the code is doing and
 why, and random numbers that happen to work on one machine, I don't
 know what to tell you.
 
 good one is using 0x8000
 and bad one is using 0xa000.

 We try to figure out if it needs low address and it happen to work
 because kernel was doing bottom up allocation.
 
 No.
 
 Let me repeat my point one more time.
 
 You have TWO choices. Not more, not less:
 
  - choice #1: go back to the old allocation model. It's tested. It
 doesn't regress. Admittedly we may not know exactly _why_ it works,
 and it might not work on all machines, but it doesn't cause
 regressions (ie the machines it doesn't work on it _never_ worked on).
 
And this doesn't mean old value for that _one_ machine. It means
 old value for _every_ machine. So it means we revert the whole
 bottom-down thing entirely. Not just change one random number so that
 the totally different allocation pattern happens to give the same
 result on one particular machine.
 
Quite frankly, I don't see the point of doing 

Re: Linux 2.6.39-rc3

2011-04-13 Thread H. Peter Anvin
On 04/13/2011 12:14 PM, Yinghai Lu wrote:
 
 so those two patches uncover some problems.
 
 [0.00] Checking aperture...
 [0.00] No AGP bridge found
 [0.00] Node 0: aperture @ a000 size 32 MB
 [0.00] Aperture pointing to e820 RAM. Ignoring.
 [0.00] Your BIOS doesn't leave a aperture memory hole
 [0.00] Please enable the IOMMU option in the BIOS setup
 [0.00] This costs you 64 MB of RAM
 [0.00] memblock_x86_reserve_range: [0xa000-0xa3ff]   
 aperture64
 [0.00] Mapping aperture over 65536 KB of RAM @ a000
 
 so kernel try to reallocate apperture. because BIOS allocated is pointed to 
 RAM or size is too small.
 
 but your radeon does use [0xa000, 0xbfff)
 
 [4.281993] radeon :01:05.0: VRAM: 320M 0xC000 - 
 0xD3FF (320M used)
 [4.290672] radeon :01:05.0: GTT: 512M 0xA000 - 
 0xBFFF
 [4.298550] [drm] Detected VRAM RAM=320M, BAR=256M
 [4.309857] [drm] RAM width 32bits DDR
 [4.313748] [TTM] Zone  kernel: Available graphics memory: 1896524 kiB.
 [4.320379] [TTM] Initializing pool allocator.
 [4.324948] [drm] radeon: 320M of VRAM memory ready
 [4.329832] [drm] radeon: 512M of GTT memory ready.
 
 and the one seems working:
 
 [0.00] Checking aperture...
 [0.00] No AGP bridge found
 [0.00] Node 0: aperture @ a000 size 32 MB
 [0.00] Aperture pointing to e820 RAM. Ignoring.
 [0.00] Your BIOS doesn't leave a aperture memory hole
 [0.00] Please enable the IOMMU option in the BIOS setup
 [0.00] This costs you 64 MB of RAM
 [0.00] memblock_x86_reserve_range: [0x8000-0x83ff]   
 aperture64
 [0.00] Mapping aperture over 65536 KB of RAM @ 8000
 [0.00] memblock_x86_reserve_range: [0xacb6bdc0-0xacb6bddf]
   BOOTMEM
 
 will use different position...
 
 [4.250159] radeon :01:05.0: VRAM: 320M 0xC000 - 
 0xD3FF (320M used)
 [4.258830] radeon :01:05.0: GTT: 512M 0xA000 - 
 0xBFFF
 [4.266742] [drm] Detected VRAM RAM=320M, BAR=256M
 [4.271549] [drm] RAM width 32bits DDR
 [4.275435] [TTM] Zone  kernel: Available graphics memory: 1896526 kiB.
 [4.282066] [TTM] Initializing pool allocator.
 [4.282085] usb 7-2: new full speed USB device number 2 using ohci_hcd
 [4.293076] [drm] radeon: 320M of VRAM memory ready
 [4.298277] [drm] radeon: 512M of GTT memory ready.
 [4.303218] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
 [4.309854] [drm] Driver supports precise vblank timestamp query.
 [4.315970] [drm] radeon: irq initialized.
 [4.320094] [drm] GART: num cpu pages 131072, num gpu pages 131072
 
 So question is why radeon is using the address [0xa000 - 0xc00], and 
 in E820 it is RAM 
 
 [0.00]  BIOS-e820: 0010 - acb8d000 (usable)
 [0.00]  BIOS-e820: acb8d000 - acb8f000 (reserved)
 [0.00]  BIOS-e820: acb8f000 - afce9000 (usable)
 [0.00]  BIOS-e820: afce9000 - afd21000 (reserved)
 [0.00]  BIOS-e820: afd21000 - afd4f000 (usable)
 [0.00]  BIOS-e820: afd4f000 - afdcf000 (reserved)
 [0.00]  BIOS-e820: afdcf000 - afecf000 (ACPI NVS)
 [0.00]  BIOS-e820: afecf000 - afeff000 (ACPI data)
 [0.00]  BIOS-e820: afeff000 - aff0 (usable)
 
 so looks bios program wrong address to the radon card?
 

Okay, staring at this, it definitely seems toxic to overlay the GART
over memory areas reserved by the BIOS.  If I were to guess, I would say
that the problem here seems to be that the kernel thinks it is
overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in
size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas.

Alex D., could you comment on the num cpu pages bit?

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread H. Peter Anvin
On 04/13/2011 04:39 PM, Linus Torvalds wrote:
 
  - Choice #2: understand exactly _what_ goes wrong, and fix it
 analytically (ie by _understanding_ the problem, and being able to
 solve it exactly, and in a way you can argue about without having to
 resort to magic happens).
 
 Now, the whole analytic approach (aka computer sciency approach),
 where you can actually think about the problem without having any
 pesky reality impact the solution is obviously the one we tend to
 prefer. Sadly, it's seldom the one we can use in reality when it comes
 to things like resource allocation, since we end up starting off with
 often buggy approximations of what the actual hardware is all about
 (ie broken firmware tables).
 
 So I'd love to know exactly why one random number works, and why
 another one doesn't. But as long as we do _not_ know the Why of it,
 we will have to revert.
 

Yes.  However, even if we *do* revert (and the time is running short on
not reverting) I would like to understand this particular one, simply
because I think it may very well be a problem that is manifesting itself
in other ways on other systems.

The other thing that this has uncovered is that we already have a bunch
of complete b*llsh*t magic numbers in this path, some of which are
trivially shown to be wrong or at least completely arbitrary, so there
are more issues here :(

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread Dave Airlie
On Wed, 2011-04-13 at 18:58 -0700, H. Peter Anvin wrote:
 On 04/13/2011 12:14 PM, Yinghai Lu wrote:
  
  so those two patches uncover some problems.
  
  [0.00] Checking aperture...
  [0.00] No AGP bridge found
  [0.00] Node 0: aperture @ a000 size 32 MB
  [0.00] Aperture pointing to e820 RAM. Ignoring.
  [0.00] Your BIOS doesn't leave a aperture memory hole
  [0.00] Please enable the IOMMU option in the BIOS setup
  [0.00] This costs you 64 MB of RAM
  [0.00] memblock_x86_reserve_range: [0xa000-0xa3ff]  
   aperture64
  [0.00] Mapping aperture over 65536 KB of RAM @ a000
  
  so kernel try to reallocate apperture. because BIOS allocated is pointed to 
  RAM or size is too small.
  
  but your radeon does use [0xa000, 0xbfff)
  
  [4.281993] radeon :01:05.0: VRAM: 320M 0xC000 - 
  0xD3FF (320M used)
  [4.290672] radeon :01:05.0: GTT: 512M 0xA000 - 
  0xBFFF
  [4.298550] [drm] Detected VRAM RAM=320M, BAR=256M
  [4.309857] [drm] RAM width 32bits DDR
  [4.313748] [TTM] Zone  kernel: Available graphics memory: 1896524 kiB.
  [4.320379] [TTM] Initializing pool allocator.
  [4.324948] [drm] radeon: 320M of VRAM memory ready
  [4.329832] [drm] radeon: 512M of GTT memory ready.
  
  and the one seems working:
  
  [0.00] Checking aperture...
  [0.00] No AGP bridge found
  [0.00] Node 0: aperture @ a000 size 32 MB
  [0.00] Aperture pointing to e820 RAM. Ignoring.
  [0.00] Your BIOS doesn't leave a aperture memory hole
  [0.00] Please enable the IOMMU option in the BIOS setup
  [0.00] This costs you 64 MB of RAM
  [0.00] memblock_x86_reserve_range: [0x8000-0x83ff]  
   aperture64
  [0.00] Mapping aperture over 65536 KB of RAM @ 8000
  [0.00] memblock_x86_reserve_range: [0xacb6bdc0-0xacb6bddf]  
  BOOTMEM
  
  will use different position...
  
  [4.250159] radeon :01:05.0: VRAM: 320M 0xC000 - 
  0xD3FF (320M used)
  [4.258830] radeon :01:05.0: GTT: 512M 0xA000 - 
  0xBFFF
  [4.266742] [drm] Detected VRAM RAM=320M, BAR=256M
  [4.271549] [drm] RAM width 32bits DDR
  [4.275435] [TTM] Zone  kernel: Available graphics memory: 1896526 kiB.
  [4.282066] [TTM] Initializing pool allocator.
  [4.282085] usb 7-2: new full speed USB device number 2 using ohci_hcd
  [4.293076] [drm] radeon: 320M of VRAM memory ready
  [4.298277] [drm] radeon: 512M of GTT memory ready.
  [4.303218] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
  [4.309854] [drm] Driver supports precise vblank timestamp query.
  [4.315970] [drm] radeon: irq initialized.
  [4.320094] [drm] GART: num cpu pages 131072, num gpu pages 131072
  
  So question is why radeon is using the address [0xa000 - 0xc00], 
  and in E820 it is RAM 
  
  [0.00]  BIOS-e820: 0010 - acb8d000 (usable)
  [0.00]  BIOS-e820: acb8d000 - acb8f000 (reserved)
  [0.00]  BIOS-e820: acb8f000 - afce9000 (usable)
  [0.00]  BIOS-e820: afce9000 - afd21000 (reserved)
  [0.00]  BIOS-e820: afd21000 - afd4f000 (usable)
  [0.00]  BIOS-e820: afd4f000 - afdcf000 (reserved)
  [0.00]  BIOS-e820: afdcf000 - afecf000 (ACPI NVS)
  [0.00]  BIOS-e820: afecf000 - afeff000 (ACPI data)
  [0.00]  BIOS-e820: afeff000 - aff0 (usable)
  
  so looks bios program wrong address to the radon card?
  
 
 Okay, staring at this, it definitely seems toxic to overlay the GART
 over memory areas reserved by the BIOS.  If I were to guess, I would say
 that the problem here seems to be that the kernel thinks it is
 overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in
 size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas.
 
 Alex D., could you comment on the num cpu pages bit?

These are not CPU addresses. I think we've stated that already. Not the
droids.

the num cpu pages is how many CPU pages would be needed to fill the GPU
GTT, for those crazy cases where CPU pagesize != GPU pagesize.

Dave.


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread Linus Torvalds
On Wednesday, April 13, 2011, H. Peter Anvin h...@zytor.com wrote:

 Yes.  However, even if we *do* revert (and the time is running short on
 not reverting) I would like to understand this particular one, simply
 because I think it may very well be a problem that is manifesting itself
 in other ways on other systems.

 The other thing that this has uncovered is that we already have a bunch
 of complete b*llsh*t magic numbers in this
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-13 Thread Tejun Heo
Hello,

On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote:
 On Wednesday, April 13, 2011, Linus Torvalds
 torva...@linux-foundation.org wrote:
  On Wednesday, April 13, 2011, H. Peter Anvin h...@zytor.com wrote:
 
  Yes.  However, even if we *do* revert (and the time is running short on
  not reverting) I would like to understand this particular one, simply
  because I think it may very well be a problem that is manifesting itself
  in other ways on other systems.
 
  sorry, fingerfart. Anyway, I agree 100%.
 
  we definitely want to also understand the reason for things not
 working, even if we do revert..

There were (and still are) places where memblock callers implemented
ad-hoc top-down allocation by stepping down start limit until
allocation succeeds.  Several of them have been removed since top-down
became the default behavior, so simply reverting the commit is likely
to cause subtle issues.  Maybe the best approach is introducing
@topdown parameter and use it selectively for pure memory allocations.

Thanks.

-- 
tejun
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-12 Thread Joerg Roedel
On Mon, Apr 11, 2011 at 05:40:11PM -0700, Linus Torvalds wrote:
 Let's hope the release cycle continues like this. I _like_ it when
 people really seem to follow the whole big changes during the merge
 window rules.

Sorry for disturbing the silence, but radeon seems to have issues. I
tested -rc3 (and after that -rc1 which also has the issue) on my Laptop
and it just reboots after (or while?) GFX initialization. The last lines
of dmesg are:

 Freeing unused kernel memory: 624k freed
 Write protecting the kernel read-only data: 8192k
 Freeing unused kernel memory: 1456k freed
 Freeing unused kernel memory: 16k freed
 udev: starting version 151
 udevd (62): /proc/62/oom_adj is deprecated, please use /proc/62/oom_score_adj 
instead.
 [drm] Initialized drm 1.1.0 20060810
 [drm] radeon defaulting to kernel modesetting.
 [drm] radeon kernel modesetting enabled.
 radeon :01:05.0: PCI INT A - GSI 18 (level, low) - IRQ 18
 [drm] initializing kernel modesetting (RS880 0x1002:0x9712).
 [drm] register mmio base: 0xD640
 [drm] register mmio size: 65536
 ATOM BIOS: HP_TAG
 radeon :01:05.0: VRAM: 320M 0xC000 - 0xD3FF (320M 
used)
 radeon :01:05.0: GTT: 512M 0xA000 - 0xBFFF
 [drm] Detected VRAM RAM=320M, BAR=256M
 [drm] RAM width 32bits DDR
 [TTM] Zone  kernel: Available graphics memory: 1896512 kiB.
 usb 7-2: new full speed USB device number 2 using ohci_hcd
 [TTM] Initializing pool allocator.
 [drm] radeon: 320M of VRAM memory ready
 [drm] radeon: 512M of GTT memory ready.
 [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
 [drm] Driver supports precise vblank timestamp query.
 [drm] radeon: irq initialized.
 [drm] GART: num cpu pages 131072, num gpu pages 131072
 [drm] Loading RS780 Microcode
 radeon :01:05.0: WB enabled
 [drm] ring test succeeded in 1 usecs
 [drm] radeon: ib pool ready.

The card is a Radeon Mobility 4200:

01:05.0 VGA compatible controller: ATI Technologies Inc M880G [Mobility Radeon 
HD 4200]
Subsystem: Hewlett-Packard Company Device 307e
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 18
Region 0: Memory at c000 (32-bit, prefetchable) [size=256M]
Region 1: I/O ports at 6000 [size=256]
Region 2: Memory at d640 (32-bit, non-prefetchable) [size=64K]
Region 5: Memory at d630 (32-bit, non-prefetchable) [size=1M]
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA 
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [a0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 
Enable-
Address:   Data: 
Kernel driver in use: radeon
Kernel modules: radeon

The problem does not happen with 2.6.38. I try to bisect this further down to a
commit. Alex, please let me know if you need any further information.

Joerg

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-12 Thread Alex Deucher
On Tue, Apr 12, 2011 at 5:02 AM, Joerg Roedel j...@8bytes.org wrote:
 On Mon, Apr 11, 2011 at 05:40:11PM -0700, Linus Torvalds wrote:
 Let's hope the release cycle continues like this. I _like_ it when
 people really seem to follow the whole big changes during the merge
 window rules.

 Sorry for disturbing the silence, but radeon seems to have issues. I
 tested -rc3 (and after that -rc1 which also has the issue) on my Laptop
 and it just reboots after (or while?) GFX initialization. The last lines
 of dmesg are:

  Freeing unused kernel memory: 624k freed
  Write protecting the kernel read-only data: 8192k
  Freeing unused kernel memory: 1456k freed
  Freeing unused kernel memory: 16k freed
  udev: starting version 151
  udevd (62): /proc/62/oom_adj is deprecated, please use 
 /proc/62/oom_score_adj instead.
  [drm] Initialized drm 1.1.0 20060810
  [drm] radeon defaulting to kernel modesetting.
  [drm] radeon kernel modesetting enabled.
  radeon :01:05.0: PCI INT A - GSI 18 (level, low) - IRQ 18
  [drm] initializing kernel modesetting (RS880 0x1002:0x9712).
  [drm] register mmio base: 0xD640
  [drm] register mmio size: 65536
  ATOM BIOS: HP_TAG
  radeon :01:05.0: VRAM: 320M 0xC000 - 0xD3FF 
 (320M used)
  radeon :01:05.0: GTT: 512M 0xA000 - 0xBFFF
  [drm] Detected VRAM RAM=320M, BAR=256M
  [drm] RAM width 32bits DDR
  [TTM] Zone  kernel: Available graphics memory: 1896512 kiB.
  usb 7-2: new full speed USB device number 2 using ohci_hcd
  [TTM] Initializing pool allocator.
  [drm] radeon: 320M of VRAM memory ready
  [drm] radeon: 512M of GTT memory ready.
  [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
  [drm] Driver supports precise vblank timestamp query.
  [drm] radeon: irq initialized.
  [drm] GART: num cpu pages 131072, num gpu pages 131072
  [drm] Loading RS780 Microcode
  radeon :01:05.0: WB enabled
  [drm] ring test succeeded in 1 usecs
  [drm] radeon: ib pool ready.

 The card is a Radeon Mobility 4200:

 01:05.0 VGA compatible controller: ATI Technologies Inc M880G [Mobility 
 Radeon HD 4200]
        Subsystem: Hewlett-Packard Company Device 307e
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
 Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- 
 TAbort- MAbort- SERR- PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 18
        Region 0: Memory at c000 (32-bit, prefetchable) [size=256M]
        Region 1: I/O ports at 6000 [size=256]
        Region 2: Memory at d640 (32-bit, non-prefetchable) [size=64K]
        Region 5: Memory at d630 (32-bit, non-prefetchable) [size=1M]
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA 
 PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [a0] Message Signalled Interrupts: Mask- 64bit+ 
 Queue=0/0 Enable-
                Address:   Data: 
        Kernel driver in use: radeon
        Kernel modules: radeon

 The problem does not happen with 2.6.38. I try to bisect this further down to 
 a
 commit. Alex, please let me know if you need any further information.

If you can bisect it, that would be great.  Thanks,

Alex


        Joerg


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-12 Thread Joerg Roedel
On Tue, Apr 12, 2011 at 10:15:11AM -0400, Alex Deucher wrote:
 On Tue, Apr 12, 2011 at 5:02 AM, Joerg Roedel j...@8bytes.org wrote:
  On Mon, Apr 11, 2011 at 05:40:11PM -0700, Linus Torvalds wrote:
  Let's hope the release cycle continues like this. I _like_ it when
  people really seem to follow the whole big changes during the merge
  window rules.
 
  Sorry for disturbing the silence, but radeon seems to have issues. I
  tested -rc3 (and after that -rc1 which also has the issue) on my Laptop
  and it just reboots after (or while?) GFX initialization. The last lines
  of dmesg are:
 
   Freeing unused kernel memory: 624k freed
   Write protecting the kernel read-only data: 8192k
   Freeing unused kernel memory: 1456k freed
   Freeing unused kernel memory: 16k freed
   udev: starting version 151
   udevd (62): /proc/62/oom_adj is deprecated, please use 
  /proc/62/oom_score_adj instead.
   [drm] Initialized drm 1.1.0 20060810
   [drm] radeon defaulting to kernel modesetting.
   [drm] radeon kernel modesetting enabled.
   radeon :01:05.0: PCI INT A - GSI 18 (level, low) - IRQ 18
   [drm] initializing kernel modesetting (RS880 0x1002:0x9712).
   [drm] register mmio base: 0xD640
   [drm] register mmio size: 65536
   ATOM BIOS: HP_TAG
   radeon :01:05.0: VRAM: 320M 0xC000 - 0xD3FF 
  (320M used)
   radeon :01:05.0: GTT: 512M 0xA000 - 0xBFFF
   [drm] Detected VRAM RAM=320M, BAR=256M
   [drm] RAM width 32bits DDR
   [TTM] Zone  kernel: Available graphics memory: 1896512 kiB.
   usb 7-2: new full speed USB device number 2 using ohci_hcd
   [TTM] Initializing pool allocator.
   [drm] radeon: 320M of VRAM memory ready
   [drm] radeon: 512M of GTT memory ready.
   [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
   [drm] Driver supports precise vblank timestamp query.
   [drm] radeon: irq initialized.
   [drm] GART: num cpu pages 131072, num gpu pages 131072
   [drm] Loading RS780 Microcode
   radeon :01:05.0: WB enabled
   [drm] ring test succeeded in 1 usecs
   [drm] radeon: ib pool ready.
 
  The card is a Radeon Mobility 4200:
 
  01:05.0 VGA compatible controller: ATI Technologies Inc M880G [Mobility 
  Radeon HD 4200]
         Subsystem: Hewlett-Packard Company Device 307e
         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
  Stepping- SERR- FastB2B- DisINTx-
         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- 
  TAbort- MAbort- SERR- PERR- INTx-
         Latency: 0, Cache Line Size: 64 bytes
         Interrupt: pin A routed to IRQ 18
         Region 0: Memory at c000 (32-bit, prefetchable) [size=256M]
         Region 1: I/O ports at 6000 [size=256]
         Region 2: Memory at d640 (32-bit, non-prefetchable) [size=64K]
         Region 5: Memory at d630 (32-bit, non-prefetchable) [size=1M]
         Capabilities: [50] Power Management version 3
                 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA 
  PME(D0-,D1-,D2-,D3hot-,D3cold-)
                 Status: D0 PME-Enable- DSel=0 DScale=0 PME-
         Capabilities: [a0] Message Signalled Interrupts: Mask- 64bit+ 
  Queue=0/0 Enable-
                 Address:   Data: 
         Kernel driver in use: radeon
         Kernel modules: radeon
 
  The problem does not happen with 2.6.38. I try to bisect this further down 
  to a
  commit. Alex, please let me know if you need any further information.
 
 If you can bisect it, that would be great.  Thanks,

Bisecting actually gave a very weird result. It points to

d2137d5af4259f50c19addb8246a186c9ffac325

which is a merge-commit in the x86 tree. Even more weird is that this
notebook is the only machine with these symptoms, all my other boxes are
fine.
During the bisect I tested commits from Yinghai which were good. It
seems like the problem appeared with the merge.

Joerg

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-12 Thread Alexandre Demers
Already tracking it here: https://bugzilla.kernel.org/show_bug.cgi?id=33012

Same problem, same culprit commit.

-- 
Alexandre Demers

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Linux 2.6.39-rc3

2011-04-12 Thread David Rientjes
On Tue, 12 Apr 2011, Joerg Roedel wrote:

 Bisecting actually gave a very weird result. It points to
 
   d2137d5af4259f50c19addb8246a186c9ffac325
 
 which is a merge-commit in the x86 tree. Even more weird is that this
 notebook is the only machine with these symptoms, all my other boxes are
 fine.
 During the bisect I tested commits from Yinghai which were good. It
 seems like the problem appeared with the merge.
 

Alexandre Demers (cc'd) reports a boot failure bisected to the same merge 
on a 64-bit AMD tricore in 
https://bugzilla.kernel.org/show_bug.cgi?id=33012.  We're awaiting 
earlyprintk= output from that kernel, if possible, and Yinghai asked for 
his .config and dmesg output from the last known working kernel.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel