[pci] WARNING: CPU: 0 PID: 1 at drivers/gpu/drm/drm_crtc.c:94 drm_warn_on_modeset_not_all_locked()

2014-03-24 Thread Bjorn Helgaas
On Sun, Mar 23, 2014 at 8:53 AM, Fengguang Wu  wrote:
> Hi Bjorn,
>
> On Fri, Mar 21, 2014 at 12:42:33PM -0600, Bjorn Helgaas wrote:
>> On Thu, Mar 20, 2014 at 8:09 PM, Fengguang Wu  
>> wrote:
>> > // CC Stephane for RAPL related bug
>> >
>> > Bjorn, sorry this bug report is mis-titled. The only new bug that show
>> > up in aa11fc58dc is on rapl_pmu_init. And it shows up only 1 time, so
>> > it's hard to reproduce and the bisect is likely not accurate.  I'll
>> > retry the bisect with more repeat count. Sorry for the disturbing!
>>
>> This testing is potentially very useful, but only if we don't have
>> many false positives.  I spent a lot of time trying to figure this
>> out, and it turned out not to be a problem at all.
>
> I'm sorry for the false report! I'll be careful and improve the
> process. Currently there are many false positives in our internal
> boot error bisects. And we rely on human reviews to select good
> bisects out of the noises. In this case both the script and me made
> mistakes, which lead to the wrong report.
>
>> As a procedural question, can you help me figure out how to handle a
>> report like this?  What I *hoped* for would be:
>>
>>   - the config you used
>
> Yes.
>
>>   - the dmesg log from the newest good commit
>
> I'll attach it if the first bad commit's parent commit(s) has some
> noise errors. In this case it may help decide whether the bisect is
> wrong: in some cases one bug will hide another one; or the bug message
> may change from one to the other.
>
>>   - the dmesg log from the oldest bad commit (the one you bisected to)
>
> OK, I've fixed the script to attach it (rather than attaching the
> branch HEAD's dmesg).
>
>>   - maybe a hint about how I can reproduce the problem, e.g., the qemu
>> config I need
>
> OK, fixed the reporting script to include the QEMU commands for
> reproducing the problem.
>
>> You did supply the config, which is good.  But you only supplied one
>> dmesg log, and it doesn't seem to be from the oldest bad commit.  In
>> fact, it seems to be from some commit that isn't actually in either
>> Linus' tree or in linux-next.  So I don't know what the connection is
>> with the bad commit.
>
> Sorry the dmesg file is from the internal merge-and-testing branch's
> HEAD -- where the bisect starts.  I'll attach the first bad commit's
> dmesg instead.
>
>> What should I do to try to debug a report like this?  Where should I start?
>
> Thank you very much for the suggestions!

Excellent, thanks!  I think these will make it much easier to figure
out where to start.

Bjorn


[pci] WARNING: CPU: 0 PID: 1 at drivers/gpu/drm/drm_crtc.c:94 drm_warn_on_modeset_not_all_locked()

2014-03-23 Thread Fengguang Wu
Hi Bjorn,

On Fri, Mar 21, 2014 at 12:42:33PM -0600, Bjorn Helgaas wrote:
> On Thu, Mar 20, 2014 at 8:09 PM, Fengguang Wu  
> wrote:
> > // CC Stephane for RAPL related bug
> >
> > Bjorn, sorry this bug report is mis-titled. The only new bug that show
> > up in aa11fc58dc is on rapl_pmu_init. And it shows up only 1 time, so
> > it's hard to reproduce and the bisect is likely not accurate.  I'll
> > retry the bisect with more repeat count. Sorry for the disturbing!
> 
> This testing is potentially very useful, but only if we don't have
> many false positives.  I spent a lot of time trying to figure this
> out, and it turned out not to be a problem at all.

I'm sorry for the false report! I'll be careful and improve the
process. Currently there are many false positives in our internal
boot error bisects. And we rely on human reviews to select good
bisects out of the noises. In this case both the script and me made
mistakes, which lead to the wrong report.

> As a procedural question, can you help me figure out how to handle a
> report like this?  What I *hoped* for would be:
> 
>   - the config you used

Yes.

>   - the dmesg log from the newest good commit

I'll attach it if the first bad commit's parent commit(s) has some
noise errors. In this case it may help decide whether the bisect is
wrong: in some cases one bug will hide another one; or the bug message
may change from one to the other.

>   - the dmesg log from the oldest bad commit (the one you bisected to)

OK, I've fixed the script to attach it (rather than attaching the
branch HEAD's dmesg).

>   - maybe a hint about how I can reproduce the problem, e.g., the qemu
> config I need

OK, fixed the reporting script to include the QEMU commands for
reproducing the problem.

> You did supply the config, which is good.  But you only supplied one
> dmesg log, and it doesn't seem to be from the oldest bad commit.  In
> fact, it seems to be from some commit that isn't actually in either
> Linus' tree or in linux-next.  So I don't know what the connection is
> with the bad commit.

Sorry the dmesg file is from the internal merge-and-testing branch's
HEAD -- where the bisect starts.  I'll attach the first bad commit's
dmesg instead.

> What should I do to try to debug a report like this?  Where should I start?

Thank you very much for the suggestions!

Regards,
Fengguang

> Bjorn
> 
> > [2.812392] Unpacking initramfs...
> > [2.812392] Unpacking initramfs...
> > [4.937582] Freeing initrd memory: 3276K (93cbd000 - 93ff)
> > [4.937582] Freeing initrd memory: 3276K (93cbd000 - 93ff)
> > [4.952113] BUG: unable to handle kernel
> > [4.952113] BUG: unable to handle kernel NULL pointer dereferenceNULL 
> > pointer dereference at 003c
> >  at 003c
> > [4.952871] IP:
> > [4.952871] IP: [<81c439fb>] rapl_pmu_init+0xed/0x165
> >  [<81c439fb>] rapl_pmu_init+0xed/0x165
> > [4.954190] *pde = 
> > [4.954190] *pde = 
> >
> > [4.954619] Oops:  [#1]
> > [4.954619] Oops:  [#1]
> >
> > [4.955440] CPU: 0 PID: 1 Comm: swapper Not tainted 
> > 3.14.0-rc1-00023-gaa11fc5 #1
> > [4.955440] CPU: 0 PID: 1 Comm: swapper Not tainted 
> > 3.14.0-rc1-00023-gaa11fc5 #1
> > [4.956050] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > [4.956050] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > [4.956672] task: 80030c20 ti: 80032000 task.ti: 80032000
> > [4.956672] task: 80030c20 ti: 80032000 task.ti: 80032000
> > [4.957295] EIP: 0060:[<81c439fb>] EFLAGS: 0246 CPU: 0
> > [4.957295] EIP: 0060:[<81c439fb>] EFLAGS: 0246 CPU: 0
> > [4.957831] EIP is at rapl_pmu_init+0xed/0x165
> > [4.957831] EIP is at rapl_pmu_init+0xed/0x165
> >
> > Full dmesg attached.
> >
> > Thanks,
> > Fengguang
> >
> > On Thu, Mar 20, 2014 at 04:50:08PM -0600, Bjorn Helgaas wrote:
> >> On Thu, Mar 20, 2014 at 6:41 AM, Fengguang Wu  
> >> wrote:
> >> > Greetings,
> >> >
> >> > I got the below dmesg and the first bad commit is
> >> >
> >> > git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git 
> >> > pci/resource
> >> >
> >> > commit aa11fc58dc71c27701b1f9a529a36a38d4337722
> >> > Author: Bjorn Helgaas 
> >> > AuthorDate: Fri Mar 7 13:39:01 2014 -0700
> >> > Commit: Bjorn Helgaas 
> >> > CommitDate: Wed Mar 19 15:00:16 2014 -0600
> >> >
> >> > PCI: Check all IORESOURCE_TYPE_BITS in pci_bus_alloc_from_region()
> >> >
> >> > When allocating space from a bus resource, i.e., from apertures 
> >> > leading to
> >> > this bus, make sure the entire resource type matches.  The previous 
> >> > code
> >> > assumed the IORESOURCE_TYPE_BITS field was a bitmask with only a 
> >> > single bit
> >> > set, but this is not true.  IORESOURCE_TYPE_BITS is really an 
> >> > enumeration,
> >> > and we have to check all the bits.
> >> >
> >> > See 72dcb1197228 ("resources: Add register address resource type").
> >> >
> >> > No functional change.  If we used 

[pci] WARNING: CPU: 0 PID: 1 at drivers/gpu/drm/drm_crtc.c:94 drm_warn_on_modeset_not_all_locked()

2014-03-21 Thread Bjorn Helgaas
On Thu, Mar 20, 2014 at 8:09 PM, Fengguang Wu  wrote:
> // CC Stephane for RAPL related bug
>
> Bjorn, sorry this bug report is mis-titled. The only new bug that show
> up in aa11fc58dc is on rapl_pmu_init. And it shows up only 1 time, so
> it's hard to reproduce and the bisect is likely not accurate.  I'll
> retry the bisect with more repeat count. Sorry for the disturbing!

This testing is potentially very useful, but only if we don't have
many false positives.  I spent a lot of time trying to figure this
out, and it turned out not to be a problem at all.

As a procedural question, can you help me figure out how to handle a
report like this?  What I *hoped* for would be:

  - the config you used
  - the dmesg log from the newest good commit
  - the dmesg log from the oldest bad commit (the one you bisected to)
  - maybe a hint about how I can reproduce the problem, e.g., the qemu
config I need

You did supply the config, which is good.  But you only supplied one
dmesg log, and it doesn't seem to be from the oldest bad commit.  In
fact, it seems to be from some commit that isn't actually in either
Linus' tree or in linux-next.  So I don't know what the connection is
with the bad commit.

What should I do to try to debug a report like this?  Where should I start?

Bjorn

> [2.812392] Unpacking initramfs...
> [2.812392] Unpacking initramfs...
> [4.937582] Freeing initrd memory: 3276K (93cbd000 - 93ff)
> [4.937582] Freeing initrd memory: 3276K (93cbd000 - 93ff)
> [4.952113] BUG: unable to handle kernel
> [4.952113] BUG: unable to handle kernel NULL pointer dereferenceNULL 
> pointer dereference at 003c
>  at 003c
> [4.952871] IP:
> [4.952871] IP: [<81c439fb>] rapl_pmu_init+0xed/0x165
>  [<81c439fb>] rapl_pmu_init+0xed/0x165
> [4.954190] *pde = 
> [4.954190] *pde = 
>
> [4.954619] Oops:  [#1]
> [4.954619] Oops:  [#1]
>
> [4.955440] CPU: 0 PID: 1 Comm: swapper Not tainted 
> 3.14.0-rc1-00023-gaa11fc5 #1
> [4.955440] CPU: 0 PID: 1 Comm: swapper Not tainted 
> 3.14.0-rc1-00023-gaa11fc5 #1
> [4.956050] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [4.956050] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [4.956672] task: 80030c20 ti: 80032000 task.ti: 80032000
> [4.956672] task: 80030c20 ti: 80032000 task.ti: 80032000
> [4.957295] EIP: 0060:[<81c439fb>] EFLAGS: 0246 CPU: 0
> [4.957295] EIP: 0060:[<81c439fb>] EFLAGS: 0246 CPU: 0
> [4.957831] EIP is at rapl_pmu_init+0xed/0x165
> [4.957831] EIP is at rapl_pmu_init+0xed/0x165
>
> Full dmesg attached.
>
> Thanks,
> Fengguang
>
> On Thu, Mar 20, 2014 at 04:50:08PM -0600, Bjorn Helgaas wrote:
>> On Thu, Mar 20, 2014 at 6:41 AM, Fengguang Wu  
>> wrote:
>> > Greetings,
>> >
>> > I got the below dmesg and the first bad commit is
>> >
>> > git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git pci/resource
>> >
>> > commit aa11fc58dc71c27701b1f9a529a36a38d4337722
>> > Author: Bjorn Helgaas 
>> > AuthorDate: Fri Mar 7 13:39:01 2014 -0700
>> > Commit: Bjorn Helgaas 
>> > CommitDate: Wed Mar 19 15:00:16 2014 -0600
>> >
>> > PCI: Check all IORESOURCE_TYPE_BITS in pci_bus_alloc_from_region()
>> >
>> > When allocating space from a bus resource, i.e., from apertures 
>> > leading to
>> > this bus, make sure the entire resource type matches.  The previous 
>> > code
>> > assumed the IORESOURCE_TYPE_BITS field was a bitmask with only a 
>> > single bit
>> > set, but this is not true.  IORESOURCE_TYPE_BITS is really an 
>> > enumeration,
>> > and we have to check all the bits.
>> >
>> > See 72dcb1197228 ("resources: Add register address resource type").
>> >
>> > No functional change.  If we used this path for allocating IRQs, DMA
>> > channels, or bus numbers, this would fix a bug because those types are
>> > indistinguishable when masked by IORESOURCE_IO | IORESOURCE_MEM.  But 
>> > we
>> > don't, so this shouldn't make any difference.
>> >
>> > Signed-off-by: Bjorn Helgaas 
>>
>> Thanks (I think).  I'm afraid I'm going to need some more help to
>> debug this.  I built aa11fc58dc with the config you supplied and
>> booted it on qemu with no real issues (it didn't boot all the way
>> because the config doesn't include a driver for my root disk, but
>> that's to be expected).
>>
>> The dmesg you supplied is for some other commit 2d18516 that I don't
>> have, so I'm confused about why it's not from aa11fc58dc.
>>
>> I did reproduce what appears to be basically the same problem with
>> a654dc797f3e, which is the 20140320 linux-next tree.  I backed up to
>> 93ecdc077282, which is where pci/next was merged (this includes
>> aa11fc58dc), but I could not reproduce the problem there.
>>
>> So bottom line, I'm confused because your bisection doesn't match what
>> I'm seeing, and I don't want to spend more time flailing around.
>>
>> Bjorn
>>
>>
>> > 

[pci] WARNING: CPU: 0 PID: 1 at drivers/gpu/drm/drm_crtc.c:94 drm_warn_on_modeset_not_all_locked()

2014-03-21 Thread Fengguang Wu
// CC Stephane for RAPL related bug

Bjorn, sorry this bug report is mis-titled. The only new bug that show
up in aa11fc58dc is on rapl_pmu_init. And it shows up only 1 time, so
it's hard to reproduce and the bisect is likely not accurate.  I'll
retry the bisect with more repeat count. Sorry for the disturbing!

[2.812392] Unpacking initramfs...
[2.812392] Unpacking initramfs...
[4.937582] Freeing initrd memory: 3276K (93cbd000 - 93ff)
[4.937582] Freeing initrd memory: 3276K (93cbd000 - 93ff)
[4.952113] BUG: unable to handle kernel
[4.952113] BUG: unable to handle kernel NULL pointer dereferenceNULL 
pointer dereference at 003c
 at 003c
[4.952871] IP:
[4.952871] IP: [<81c439fb>] rapl_pmu_init+0xed/0x165
 [<81c439fb>] rapl_pmu_init+0xed/0x165
[4.954190] *pde = 
[4.954190] *pde = 

[4.954619] Oops:  [#1]
[4.954619] Oops:  [#1]

[4.955440] CPU: 0 PID: 1 Comm: swapper Not tainted 
3.14.0-rc1-00023-gaa11fc5 #1
[4.955440] CPU: 0 PID: 1 Comm: swapper Not tainted 
3.14.0-rc1-00023-gaa11fc5 #1
[4.956050] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
   
[4.956050] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[4.956672] task: 80030c20 ti: 80032000 task.ti: 80032000
[4.956672] task: 80030c20 ti: 80032000 task.ti: 80032000
[4.957295] EIP: 0060:[<81c439fb>] EFLAGS: 0246 CPU: 0
[4.957295] EIP: 0060:[<81c439fb>] EFLAGS: 0246 CPU: 0
[4.957831] EIP is at rapl_pmu_init+0xed/0x165
[4.957831] EIP is at rapl_pmu_init+0xed/0x165

Full dmesg attached.

Thanks,
Fengguang

On Thu, Mar 20, 2014 at 04:50:08PM -0600, Bjorn Helgaas wrote:
> On Thu, Mar 20, 2014 at 6:41 AM, Fengguang Wu  
> wrote:
> > Greetings,
> >
> > I got the below dmesg and the first bad commit is
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git pci/resource
> >
> > commit aa11fc58dc71c27701b1f9a529a36a38d4337722
> > Author: Bjorn Helgaas 
> > AuthorDate: Fri Mar 7 13:39:01 2014 -0700
> > Commit: Bjorn Helgaas 
> > CommitDate: Wed Mar 19 15:00:16 2014 -0600
> >
> > PCI: Check all IORESOURCE_TYPE_BITS in pci_bus_alloc_from_region()
> >
> > When allocating space from a bus resource, i.e., from apertures leading 
> > to
> > this bus, make sure the entire resource type matches.  The previous code
> > assumed the IORESOURCE_TYPE_BITS field was a bitmask with only a single 
> > bit
> > set, but this is not true.  IORESOURCE_TYPE_BITS is really an 
> > enumeration,
> > and we have to check all the bits.
> >
> > See 72dcb1197228 ("resources: Add register address resource type").
> >
> > No functional change.  If we used this path for allocating IRQs, DMA
> > channels, or bus numbers, this would fix a bug because those types are
> > indistinguishable when masked by IORESOURCE_IO | IORESOURCE_MEM.  But we
> > don't, so this shouldn't make any difference.
> >
> > Signed-off-by: Bjorn Helgaas 
> 
> Thanks (I think).  I'm afraid I'm going to need some more help to
> debug this.  I built aa11fc58dc with the config you supplied and
> booted it on qemu with no real issues (it didn't boot all the way
> because the config doesn't include a driver for my root disk, but
> that's to be expected).
> 
> The dmesg you supplied is for some other commit 2d18516 that I don't
> have, so I'm confused about why it's not from aa11fc58dc.
> 
> I did reproduce what appears to be basically the same problem with
> a654dc797f3e, which is the 20140320 linux-next tree.  I backed up to
> 93ecdc077282, which is where pci/next was merged (this includes
> aa11fc58dc), but I could not reproduce the problem there.
> 
> So bottom line, I'm confused because your bisection doesn't match what
> I'm seeing, and I don't want to spend more time flailing around.
> 
> Bjorn
> 
> 
> > ++++
> > |   
> >  | aa11fc58dc | 2d18516523 |
> > ++++
> > | boot_successes
> >  | 19 | 0  |
> > | boot_failures 
> >  | 1  | 19 |
> > | BUG:unable_to_handle_kernel_NULL_pointer_dereference  
> >  | 1  | 1  |
> > | Oops  
> >  | 1  | 1  |
> > | EIP_is_at_rapl_pmu_init   
> >  | 1  | 1  |
> > | 

[pci] WARNING: CPU: 0 PID: 1 at drivers/gpu/drm/drm_crtc.c:94 drm_warn_on_modeset_not_all_locked()

2014-03-20 Thread Bjorn Helgaas
On Thu, Mar 20, 2014 at 6:41 AM, Fengguang Wu  wrote:
> Greetings,
>
> I got the below dmesg and the first bad commit is
>
> git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git pci/resource
>
> commit aa11fc58dc71c27701b1f9a529a36a38d4337722
> Author: Bjorn Helgaas 
> AuthorDate: Fri Mar 7 13:39:01 2014 -0700
> Commit: Bjorn Helgaas 
> CommitDate: Wed Mar 19 15:00:16 2014 -0600
>
> PCI: Check all IORESOURCE_TYPE_BITS in pci_bus_alloc_from_region()
>
> When allocating space from a bus resource, i.e., from apertures leading to
> this bus, make sure the entire resource type matches.  The previous code
> assumed the IORESOURCE_TYPE_BITS field was a bitmask with only a single 
> bit
> set, but this is not true.  IORESOURCE_TYPE_BITS is really an enumeration,
> and we have to check all the bits.
>
> See 72dcb1197228 ("resources: Add register address resource type").
>
> No functional change.  If we used this path for allocating IRQs, DMA
> channels, or bus numbers, this would fix a bug because those types are
> indistinguishable when masked by IORESOURCE_IO | IORESOURCE_MEM.  But we
> don't, so this shouldn't make any difference.
>
> Signed-off-by: Bjorn Helgaas 

Thanks (I think).  I'm afraid I'm going to need some more help to
debug this.  I built aa11fc58dc with the config you supplied and
booted it on qemu with no real issues (it didn't boot all the way
because the config doesn't include a driver for my root disk, but
that's to be expected).

The dmesg you supplied is for some other commit 2d18516 that I don't
have, so I'm confused about why it's not from aa11fc58dc.

I did reproduce what appears to be basically the same problem with
a654dc797f3e, which is the 20140320 linux-next tree.  I backed up to
93ecdc077282, which is where pci/next was merged (this includes
aa11fc58dc), but I could not reproduce the problem there.

So bottom line, I'm confused because your bisection doesn't match what
I'm seeing, and I don't want to spend more time flailing around.

Bjorn


> ++++
> | 
>| aa11fc58dc | 2d18516523 |
> ++++
> | boot_successes  
>| 19 | 0  |
> | boot_failures   
>| 1  | 19 |
> | BUG:unable_to_handle_kernel_NULL_pointer_dereference
>| 1  | 1  |
> | Oops
>| 1  | 1  |
> | EIP_is_at_rapl_pmu_init 
>| 1  | 1  |
> | Kernel_panic-not_syncing:Attempted_to_kill_init_exitcode=   
>| 1  | 1  |
> | backtrace:rapl_pmu_init 
>| 1  | 1  |
> | backtrace:kernel_init_freeable  
>| 1  | 19 |
> | 
> WARNING:CPU:PID:at_drivers/gpu/drm/drm_crtc.c:drm_warn_on_modeset_not_all_locked()
>  | 0  | 18 |
> | 
> WARNING:CPU:PID:at_drivers/gpu/drm/drm_crtc_helper.c:drm_helper_encoder_in_use()
>| 0  | 18 |
> | 
> WARNING:CPU:PID:at_drivers/gpu/drm/drm_crtc_helper.c:drm_helper_crtc_in_use() 
>  | 0  | 18 |
> | 
> WARNING:CPU:PID:at_drivers/gpu/drm/drm_crtc_helper.c:drm_helper_probe_single_connector_modes()
>  | 0  | 18 |
> | WARNING:CPU:PID:at_drivers/gpu/drm/drm_modes.c:drm_mode_probed_add()
>| 0  | 18 |
> | 
> WARNING:CPU:PID:at_drivers/gpu/drm/drm_modes.c:drm_mode_connector_list_update()
> | 0  | 18 |
> | backtrace:drm_helper_disable_unused_functions   
>| 0  | 18 |
> | backtrace:cirrus_fbdev_init 
>| 0  | 18 |
> | backtrace:cirrus_modeset_init   
>| 0  | 18 |
> | backtrace:__pci_register_driver 
>| 0  | 18 |
> | backtrace:drm_pci_init  
>| 0  | 18 |
> | backtrace:cirrus_init