Bug#682007: linux-image-3.2.0-0.bpo.2-amd64: NULL pointer dereference in __fscache_read_or_alloc_pages

2012-07-20 Thread Brian Kroth

Ben Hutchings  2012-07-19 16:21:

On Thu, Jul 19, 2012 at 09:03:26AM -0500, Brian Kroth wrote:

Ben Hutchings  2012-07-19 13:32:
> On Thu, 2012-07-19 at 13:37 +0200, Bastian Blank wrote:
>> On Wed, Jul 18, 2012 at 11:16:33AM -0500, Brian Kroth wrote:
>>> ** Tainted: PO (4097)
>>>  * Proprietary module has been loaded.
>>>  * Out-of-tree module has been loaded.
>>
>>> 21:04:00 kefka [187206.183487] Pid: 20810, comm: MATLAB Tainted: P
>>> O 3.2.0-0.bpo.2-amd64 #1
>>
>> We don't support proprietary stuff. Please remove and try again.
>
> To be clear, Bastian is referring to the proprietary kernel module
> (nvidia).

Ok.  The driver is required for some third party engineering
software we have to run, but I can rig a spare machine to run some
of these other jobs without it for a bit.  I'll report back if/when
I have a new panic.

I will note though that the driver comes from the debian provided
packages (albeit from backports instead of stable).


I realise that, but it's not part of Debian proper and none of us
signed up to debug drivers that don't come with source.


Fair enough.

I've attached a new set of kernel messages captured from some runs 
without the nvidia driver loaded, but with the rest of the setup the 
same.  It doesn't quite seem to be tickling the same code path - this 
time it's an invalid opcode message instead of a NULL pointer 
dereference.  I'll let it go for a while more to see if I can get the 
same style message to back.  Unfortunately I don't exactly know how to 
reproduce it.


Thanks,
Brian
Jul 19 15:45:53 kefka [ 4289.632673] [ cut here ]
Jul 19 15:45:53 kefka [ 4289.632711] kernel BUG at 
/build/buildd-linux_3.2.20-1~bpo60+1-amd64-tQMw4f/linux-3.2.20/fs/buffer.c:3088!
Jul 19 15:45:53 kefka [ 4289.632756] invalid opcode:  [#1] 
Jul 19 15:45:53 kefka SMP 
Jul 19 15:45:53 kefka 
Jul 19 15:45:53 kefka [ 4289.632784] CPU 3 
Jul 19 15:45:53 kefka 
Jul 19 15:45:53 kefka [ 4289.632792] Modules linked in:
Jul 19 15:45:53 kefka acpi_cpufreq
Jul 19 15:45:53 kefka mperf
Jul 19 15:45:53 kefka cpufreq_userspace
Jul 19 15:45:53 kefka cpufreq_powersave
Jul 19 15:45:53 kefka cpufreq_conservative
Jul 19 15:45:53 kefka cpufreq_stats
Jul 19 15:45:53 kefka autofs4
Jul 19 15:45:53 kefka cachefiles
Jul 19 15:45:53 kefka kvm_intel
Jul 19 15:45:53 kefka kvm
Jul 19 15:45:53 kefka binfmt_misc
Jul 19 15:45:53 kefka nfsd
Jul 19 15:45:53 kefka nfs
Jul 19 15:45:53 kefka lockd
Jul 19 15:45:53 kefka fscache
Jul 19 15:45:53 kefka auth_rpcgss
Jul 19 15:45:53 kefka nfs_acl
Jul 19 15:45:53 kefka sunrpc
Jul 19 15:45:53 kefka netconsole
Jul 19 15:45:53 kefka configfs
Jul 19 15:45:53 kefka ext3
Jul 19 15:45:53 kefka jbd
Jul 19 15:45:53 kefka coretemp
Jul 19 15:45:53 kefka ipmi_watchdog
Jul 19 15:45:53 kefka ipmi_devintf
Jul 19 15:45:53 kefka ipmi_si
Jul 19 15:45:53 kefka ipmi_msghandler
Jul 19 15:45:53 kefka fuse
Jul 19 15:45:53 kefka uhci_hcd
Jul 19 15:45:53 kefka ohci_hcd
Jul 19 15:45:53 kefka tpm_infineon
Jul 19 15:45:53 kefka snd_hda_codec_realtek
Jul 19 15:45:53 kefka snd_hda_intel
Jul 19 15:45:53 kefka snd_hda_codec
Jul 19 15:45:53 kefka snd_hwdep
Jul 19 15:45:53 kefka snd_pcm_oss
Jul 19 15:45:53 kefka snd_mixer_oss
Jul 19 15:45:53 kefka snd_pcm
Jul 19 15:45:53 kefka snd_seq_midi
Jul 19 15:45:53 kefka snd_rawmidi
Jul 19 15:45:53 kefka snd_seq_midi_event
Jul 19 15:45:53 kefka snd_seq
Jul 19 15:45:53 kefka snd_timer
Jul 19 15:45:53 kefka snd_seq_device
Jul 19 15:45:53 kefka snd
Jul 19 15:45:53 kefka i2c_i801
Jul 19 15:45:53 kefka tpm_tis
Jul 19 15:45:53 kefka tpm
Jul 19 15:45:53 kefka processor
Jul 19 15:45:53 kefka soundcore
Jul 19 15:45:53 kefka hp_wmi
Jul 19 15:45:53 kefka sparse_keymap
Jul 19 15:45:53 kefka rfkill
Jul 19 15:45:53 kefka tpm_bios
Jul 19 15:45:53 kefka snd_page_alloc
Jul 19 15:45:53 kefka thermal_sys
Jul 19 15:45:53 kefka i2c_core
Jul 19 15:45:53 kefka psmouse
Jul 19 15:45:53 kefka wmi
Jul 19 15:45:53 kefka serio_raw
Jul 19 15:45:53 kefka evdev
Jul 19 15:45:53 kefka joydev
Jul 19 15:45:53 kefka button
Jul 19 15:45:53 kefka ext4
Jul 19 15:45:53 kefka mbcache
Jul 19 15:45:53 kefka jbd2
Jul 19 15:45:53 kefka crc16
Jul 19 15:45:53 kefka dm_mod
Jul 19 15:45:53 kefka raid10
Jul 19 15:45:53 kefka raid456
Jul 19 15:45:53 kefka async_raid6_recov
Jul 19 15:45:53 kefka async_pq
Jul 19 15:45:53 kefka raid6_pq
Jul 19 15:45:53 kefka async_xor
Jul 19 15:45:53 kefka xor
Jul 19 15:45:53 kefka async_memcpy
Jul 19 15:45:53 kefka async_tx
Jul 19 15:45:53 kefka raid1
Jul 19 15:45:53 kefka raid0
Jul 19 15:45:53 kefka multipath
Jul 19 15:45:53 kefka linear
Jul 19 15:45:53 kefka md_mod
Jul 19 15:45:53 kefka hid_microsoft
Jul 19 15:45:53 kefka usbhid
Jul 19 15:45:53 kefka hid
Jul 19 15:45:53 kefka sg
Jul 19 15:45:53 kefka sr_mod
Jul 19 15:45:53 kefka sd_mod
Jul 19 15:45:53 kefka cdrom
Jul 19 15:45:53 kefka crc_t10dif
Jul 19 15:45:53 kefka ahci
Jul 19 15:45:53 kefka libahci
Jul 19 15:45:53 kefka libata
Jul 19 15:45:53 kefka scsi_mod
Jul 19 15:45:53 kefka ehci_hcd
Jul 19 15:45:53 kefka e1000e
Ju

Bug#682007: linux-image-3.2.0-0.bpo.2-amd64: NULL pointer dereference in __fscache_read_or_alloc_pages

2012-07-19 Thread Ben Hutchings
On Thu, Jul 19, 2012 at 09:03:26AM -0500, Brian Kroth wrote:
> Ben Hutchings  2012-07-19 13:32:
> >On Thu, 2012-07-19 at 13:37 +0200, Bastian Blank wrote:
> >>On Wed, Jul 18, 2012 at 11:16:33AM -0500, Brian Kroth wrote:
> >>> ** Tainted: PO (4097)
> >>>  * Proprietary module has been loaded.
> >>>  * Out-of-tree module has been loaded.
> >>
> >>> 21:04:00 kefka [187206.183487] Pid: 20810, comm: MATLAB Tainted: P
> >>> O 3.2.0-0.bpo.2-amd64 #1
> >>
> >>We don't support proprietary stuff. Please remove and try again.
> >
> >To be clear, Bastian is referring to the proprietary kernel module
> >(nvidia).
> 
> Ok.  The driver is required for some third party engineering
> software we have to run, but I can rig a spare machine to run some
> of these other jobs without it for a bit.  I'll report back if/when
> I have a new panic.
>
> I will note though that the driver comes from the debian provided
> packages (albeit from backports instead of stable).

I realise that, but it's not part of Debian proper and none of us
signed up to debug drivers that don't come with source.

Ben.

-- 
Ben Hutchings
We get into the habit of living before acquiring the habit of thinking.
  - Albert Camus


-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120719152127.gy1...@decadent.org.uk



Bug#682007: linux-image-3.2.0-0.bpo.2-amd64: NULL pointer dereference in __fscache_read_or_alloc_pages

2012-07-19 Thread Brian Kroth

Ben Hutchings  2012-07-19 13:32:

On Thu, 2012-07-19 at 13:37 +0200, Bastian Blank wrote:

On Wed, Jul 18, 2012 at 11:16:33AM -0500, Brian Kroth wrote:
> ** Tainted: PO (4097)
>  * Proprietary module has been loaded.
>  * Out-of-tree module has been loaded.

> 21:04:00 kefka [187206.183487] Pid: 20810, comm: MATLAB Tainted: P
> O 3.2.0-0.bpo.2-amd64 #1

We don't support proprietary stuff. Please remove and try again.


To be clear, Bastian is referring to the proprietary kernel module
(nvidia).


Ok.  The driver is required for some third party engineering software we 
have to run, but I can rig a spare machine to run some of these other 
jobs without it for a bit.  I'll report back if/when I have a new panic.


I will note though that the driver comes from the debian provided 
packages (albeit from backports instead of stable).


Thanks,
Brian


signature.asc
Description: Digital signature


Bug#682007: linux-image-3.2.0-0.bpo.2-amd64: NULL pointer dereference in __fscache_read_or_alloc_pages

2012-07-19 Thread Ben Hutchings
On Thu, 2012-07-19 at 13:37 +0200, Bastian Blank wrote:
> On Wed, Jul 18, 2012 at 11:16:33AM -0500, Brian Kroth wrote:
> > ** Tainted: PO (4097)
> >  * Proprietary module has been loaded.
> >  * Out-of-tree module has been loaded.
> 
> > 21:04:00 kefka [187206.183487] Pid: 20810, comm: MATLAB Tainted: P
> > O 3.2.0-0.bpo.2-amd64 #1
> 
> We don't support proprietary stuff. Please remove and try again.

To be clear, Bastian is referring to the proprietary kernel module
(nvidia).

Ben.

-- 
Ben Hutchings
DNRC Motto:  I can please only one person per day.
Today is not your day.  Tomorrow isn't looking good either.


signature.asc
Description: This is a digitally signed message part


Bug#682007: linux-image-3.2.0-0.bpo.2-amd64: NULL pointer dereference in __fscache_read_or_alloc_pages

2012-07-19 Thread Bastian Blank
On Wed, Jul 18, 2012 at 11:16:33AM -0500, Brian Kroth wrote:
> ** Tainted: PO (4097)
>  * Proprietary module has been loaded.
>  * Out-of-tree module has been loaded.

> 21:04:00 kefka [187206.183487] Pid: 20810, comm: MATLAB Tainted: P
> O 3.2.0-0.bpo.2-amd64 #1

We don't support proprietary stuff. Please remove and try again.


-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120719113721.ga31...@wavehammer.waldi.eu.org



Bug#682007: linux-image-3.2.0-0.bpo.2-amd64: NULL pointer dereference in __fscache_read_or_alloc_pages

2012-07-18 Thread Brian Kroth

Subject: linux-image-3.2.0-0.bpo.2-amd64: NULL pointer dereference in 
__fscache_read_or_alloc_pages
Package: src:linux
Version: 3.2.20-1~bpo60+1
Severity: important

** Please type your report below this line ***

I have a number of machines running linux-image-3.2.0-0.bpo.2-amd64 from 
squeeze-backports that are experiencing a NULL pointer dereference bug 
in __fscache_read_or_alloc_pages fairly consistently.  Out of ~120 
machines at least 10 of them seem to experience a panic once a day.  The 
full details of a typical panic as captured via netconsole are included 
below.


The relevant setup details are as follows:

Third party applications (eg: matlab) are installed on an NFS server.  
Clients mount the exports (one fs/export per application) via nfsv4's 
root exports traversal mechanism (I forget what it's really called off 
hand).  In the options they include "ro" and "fsc" and run cachefilesd 
(0.10.4 since 0.9-3 had an excessive debug logging bug - #620732) so 
that the mostly static and read-only application data can be cached 
locally.


The mount point for the cachefilesd looks like this:
/dev/mapper/vg-fscache /var/cache/fscache ext4 
rw,relatime,errors=panic,user_xattr,acl,barrier=1,data=ordered 0 0

The cachefilesd.conf file is also included below in case it matters.

From all of the detailed panic reports I've looked at the bug seems to 
be triggered on a MATLAB comm, but that might just be that this is our 
less busy time of the year so there's more condor compute jobs running 
while the machines are otherwise idle.  Since many of those jobs need 
the third party apps they'll tend to be using the fscache more 
frequently.


What that also means is that I haven't seen any of these bugs show up 
referencing data that's on one of our other nfsv3 mounts yet.  They also 
all have fsc turned on.  Not sure if that's relevant or just a red 
herring though.


Let me know if you need any more details.

Thanks,
Brian

-- Package-specific info:
** Version:
Linux version 3.2.0-0.bpo.2-amd64 (Debian 3.2.20-1~bpo60+1) 
(debian-kernel@lists.debian.org) (gcc version 4.4.5 (Debian 4.4.5-8) ) #1 SMP 
Fri Jun 29 20:42:29 UTC 2012

** Command line:
BOOT_IMAGE=/vmlinuz-3.2.0-0.bpo.2-amd64 root=/dev/mapper/vg-root ro panic=30 
rootdelay=10 quiet

** Tainted: PO (4097)
 * Proprietary module has been loaded.
 * Out-of-tree module has been loaded.

** Kernel log:
[   28.191814] ACPI: Power Button [PWRB]
[   28.191852] input: Power Button as 
/devices/LNXSYSTM:00/LNXPWRBN:00/input/input4
[   28.191874] ACPI: Power Button [PWRF]
[   28.200722] wmi: Mapper loaded
[   28.658193] i801_smbus :00:1f.3: PCI INT C -> GSI 18 (level, low) -> IRQ 
18
[   28.888998] tpm_tis 00:0b: 1.2 TPM (device-id 0xB, rev-id 16)
[   28.949523] input: HP WMI hotkeys as /devices/virtual/input/input5
[   29.264980] nvidia: module license 'NVIDIA' taints kernel.
[   29.264983] Disabling lock debugging due to kernel taint
[   29.346355] nvidia :01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[   29.346362] nvidia :01:00.0: setting latency timer to 64
[   29.346366] vgaarb: device changed decodes: 
PCI::01:00.0,olddecodes=io+mem,decodes=none:owns=io+mem
[   29.346425] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  295.59  Wed Jun  
6 21:19:40 PDT 2012
[   29.610240] snd_hda_intel :00:1b.0: PCI INT A -> GSI 22 (level, low) -> 
IRQ 22
[   29.610287] snd_hda_intel :00:1b.0: irq 48 for MSI/MSI-X
[   29.610308] snd_hda_intel :00:1b.0: setting latency timer to 64
[   29.668854] input: HDA Intel PCH Headphone as 
/devices/pci:00/:00:1b.0/sound/card0/input6
[   30.548826] EXT4-fs (dm-0): re-mounted. Opts: (null)
[   30.726930] EXT4-fs (dm-0): re-mounted. Opts: errors=panic
[   30.864134] scsi_verify_blk_ioctl: 14 callbacks suppressed
[   30.864136] mdadm: sending ioctl 1261 to a partition!
[   30.864138] mdadm: sending ioctl 1261 to a partition!
[   30.871044] mdadm: sending ioctl 1261 to a partition!
[   30.871055] mdadm: sending ioctl 1261 to a partition!
[   30.879765] mdadm: sending ioctl 1261 to a partition!
[   30.879776] mdadm: sending ioctl 1261 to a partition!
[   30.888659] mdadm: sending ioctl 1261 to a partition!
[   30.888663] mdadm: sending ioctl 1261 to a partition!
[   30.889120] mdadm: sending ioctl 800c0910 to a partition!
[   30.889122] mdadm: sending ioctl 800c0910 to a partition!
[   30.910763] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[   30.912552] uhci_hcd: USB Universal Host Controller Interface driver
[   30.941330] fuse init (API version 7.17)
[   30.957496] ipmi message handler version 39.2
[   30.968837] IPMI System Interface driver.
[   30.968859] ipmi_si: probing via SMBIOS
[   30.968860] ipmi_si: SMBIOS: mem 0x0 regsize 1 spacing 1 irq 0
[   30.968862] ipmi_si: Adding SMBIOS-specified kcs state machine
[   30.968864] ipmi_si: Trying SMBIOS-specified kcs state machine at mem 
address 0x0, slave address 0x20, irq 0
[   30.968866] ipmi_si: Could not set up I/O space
[   3