Bug#682007: [squeeze-backports] NULL pointer dereference in __fscache_read_or_alloc_pages
Hi! It seems that a couple of changes went into linux 3.8 [1] and 3.11 [2]. Maybe it is worth to upgrade to wheezy with a 3.9 backports kernel or wait for a 3.11 release? Cheers, Raoul [1] http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/fs/cachefiles?h=linux-3.8.y [2] http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/fs/cachefiles -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/51e54553.5020...@ipax.at
Bug#682007: [squeeze-backports] NULL pointer dereference in __fscache_read_or_alloc_pages
On 2012-10-12 20:52, Brian Kroth wrote: Brian Paul Kroth bpkr...@gmail.com 2012-10-11 14:06: Jonathan Nieder jrnie...@gmail.com 2012-10-01 01:25: snip/ Once again very sorry for the delay :( I forgot to disable the DEBUG_INFO and kept filling up my build VMs disk during compile. Then realized I had grabbed the 3.7 rc code, which these patches don't apply against. git checkout remotes/stable/linux-3.2.y (results in head c74a5e1fe4d0672936c8fb63d7484dfeaa30669c and 3.2.28), seemed to fix that. snip/ Anyways, I just started running that on a machine, so I'll let you know if I noticed anything there first before I think about pushing it to further places. Thanks, Brian Got another panic using this kernel/set of patches. The dump is attached. Let me know if you need anything else. Hi! Has there been any progress regarding this issue? Brian, are you right now using the fsc facility or not? If yes, which patches / configure options are you currently using and how often do you see kernel panics? Are there any workarounds to this issue besides disabling fsc? Thanks, Raoul -- DI (FH) Raoul Bhatia M.Sc. email. r.bha...@ipax.at Technischer Leiter IPAX - Aloy Bhatia Hava OEG web. http://www.ipax.at Barawitzkagasse 10/2/2/11 email.off...@ipax.at 1190 Wien tel. +43 1 3670030 FN 277995t HG Wien fax.+43 1 3670030 15 -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/b669fb912d6ba31617e96f5faa5ad...@ipax.at
Bug#682007: [squeeze-backports] NULL pointer dereference in __fscache_read_or_alloc_pages
Raoul Bhatia [IPAX] r.bha...@ipax.at 2013-01-29 11:01: On 2012-10-12 20:52, Brian Kroth wrote: Brian Paul Kroth bpkr...@gmail.com 2012-10-11 14:06: Jonathan Nieder jrnie...@gmail.com 2012-10-01 01:25: snip/ Once again very sorry for the delay :( I forgot to disable the DEBUG_INFO and kept filling up my build VMs disk during compile. Then realized I had grabbed the 3.7 rc code, which these patches don't apply against. git checkout remotes/stable/linux-3.2.y (results in head c74a5e1fe4d0672936c8fb63d7484dfeaa30669c and 3.2.28), seemed to fix that. snip/ Anyways, I just started running that on a machine, so I'll let you know if I noticed anything there first before I think about pushing it to further places. Thanks, Brian Got another panic using this kernel/set of patches. The dump is attached. Let me know if you need anything else. Hi! Hello! Has there been any progress regarding this issue? Not really. At least not that I'm aware of. Brian, are you right now using the fsc facility or not? Yes, with 54 mounts each on about 100 hosts. If yes, which patches / configure options are you currently using and how often do you see kernel panics? Currently we're running this kernel most places: ii linux-image-3.2.0-0.bpo.2-amd64 3.2.20-1~bpo60+1 Linux 3.2 for 64-bit PCs With a few hosts gradually moving over to this: ii linux-image-3.2.0-0.bpo.4-amd64 3.2.35-2~bpo60+1 Linux 3.2 for 64-bit PCs And one host running 3.2.28 with the set of patches from here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=682007#47 We've seen the panic on all of those kernels. Since it's fairly recent, I've attached another dump from the bpo.4 3.2.35 kernel's panic. Frequency and cause is a little difficult to tease out precisely. These are lab machines and the workload may vary quite substantially based on what classes and compute jobs (eg: from condor) happen to be running on them at any given time. Recently (as of the students returning a week and a half ago) we've seen this on 6 machines. Past I see 37 other events in the last 90 days (our log rotation period). Usually their clustered together, so probably tied to a particular job's workload. Unfortunately, those jobs are usually gone by the time I see it. Are there any workarounds to this issue besides disabling fsc? Not that I'm aware of. Let me know if you need anything else. Thanks, Brian Jan 19 02:08:44 tux-116 [120882.927408] BUG: unable to handle kernel Jan 19 02:08:44 tux-116 NULL pointer dereference Jan 19 02:08:44 tux-116 at 0040 Jan 19 02:08:44 tux-116 [120882.927421] IP: Jan 19 02:08:44 tux-116 [a103c5f7] __fscache_read_or_alloc_pages+0x194/0x262 [fscache] Jan 19 02:08:44 tux-116 [120882.927432] PGD 22120c067 Jan 19 02:08:44 tux-116 PUD 22157d067 Jan 19 02:08:44 tux-116 PMD 0 Jan 19 02:08:44 tux-116 Jan 19 02:08:44 tux-116 [120882.927440] Oops: [#1] Jan 19 02:08:44 tux-116 SMP Jan 19 02:08:44 tux-116 Jan 19 02:08:44 tux-116 [120882.927446] CPU 0 Jan 19 02:08:44 tux-116 Jan 19 02:08:44 tux-116 [120882.927449] Modules linked in: Jan 19 02:08:44 tux-116 btrfs Jan 19 02:08:44 tux-116 zlib_deflate Jan 19 02:08:44 tux-116 libcrc32c Jan 19 02:08:44 tux-116 ufs Jan 19 02:08:44 tux-116 qnx4 Jan 19 02:08:44 tux-116 hfsplus Jan 19 02:08:44 tux-116 hfs Jan 19 02:08:44 tux-116 minix Jan 19 02:08:44 tux-116 ntfs Jan 19 02:08:44 tux-116 vfat Jan 19 02:08:44 tux-116 msdos Jan 19 02:08:44 tux-116 fat Jan 19 02:08:44 tux-116 jfs Jan 19 02:08:44 tux-116 xfs Jan 19 02:08:44 tux-116 reiserfs Jan 19 02:08:44 tux-116 ext2 Jan 19 02:08:44 tux-116 cpufreq_userspace Jan 19 02:08:44 tux-116 cpufreq_powersave Jan 19 02:08:44 tux-116 cpufreq_conservative Jan 19 02:08:44 tux-116 cpufreq_stats Jan 19 02:08:44 tux-116 autofs4 Jan 19 02:08:44 tux-116 cachefiles Jan 19 02:08:44 tux-116 binfmt_misc Jan 19 02:08:44 tux-116 kvm_intel Jan 19 02:08:44 tux-116 kvm Jan 19 02:08:44 tux-116 nfsd Jan 19 02:08:44 tux-116 nfs Jan 19 02:08:44 tux-116 lockd Jan 19 02:08:44 tux-116 fscache Jan 19 02:08:44 tux-116 auth_rpcgss Jan 19 02:08:44 tux-116 nfs_acl Jan 19 02:08:44 tux-116 sunrpc Jan 19 02:08:44 tux-116 netconsole Jan 19 02:08:44 tux-116 configfs Jan 19 02:08:44 tux-116 ext3 Jan 19 02:08:44 tux-116 jbd Jan 19 02:08:44 tux-116 dm_crypt Jan 19 02:08:44 tux-116 sbs Jan 19 02:08:44 tux-116 power_supply Jan 19 02:08:44 tux-116 sbshc Jan 19 02:08:44 tux-116 adt7475 Jan 19 02:08:44 tux-116 hwmon_vid Jan 19 02:08:44 tux-116 ipmi_watchdog Jan 19 02:08:44 tux-116 ipmi_devintf Jan 19 02:08:44 tux-116 ipmi_si Jan 19 02:08:44 tux-116 ipmi_msghandler Jan 19 02:08:44 tux-116 fuse Jan 19 02:08:44 tux-116 uhci_hcd Jan 19 02:08:44 tux-116 ohci_hcd Jan 19 02:08:44 tux-116 snd_hda_codec_realtek Jan 19 02:08:44 tux-116 snd_hda_intel Jan 19 02:08:44 tux-116 snd_hda_codec Jan 19 02:08:44 tux-116 snd_hwdep Jan 19 02:08:44 tux-116 snd_pcm_oss Jan 19 02:08:44 tux-116 snd_mixer_oss Jan 19 02:08:44 tux-116 tpm_infineon Jan 19
Bug#682007: [squeeze-backports] NULL pointer dereference in __fscache_read_or_alloc_pages
Jonathan Nieder jrnie...@gmail.com 2012-07-20 11:25: merge 682116 682007 quit Hi, Ben Hutchings b...@decadent.org.uk 2012-07-19 13:32: On Thu, 2012-07-19 at 13:37 +0200, Bastian Blank wrote: We don't support proprietary stuff. Please remove and try again. To be clear, Bastian is referring to the proprietary kernel module (nvidia). I think this stance is too aggressive. Testing without the modules we do not support can certainly help, but in cases like this where the proprietary module is not likely to be related, I'd rather hear about problems earlier than have submitters wait until they have time to reproduce without. Luckily this has been reproduced without the nvidia module, so merging. Rhaoul writes: This is reproducable using grep -r abc * inside a directory with 9541 files (no sym- or hardlinks, no block or character special files) in 1524 directories (PHP MODX installation) I downloaded the couple of files from that site [1] and unzipped them to hopefully create a similar test setup. I had to make two copies of it to get that many files/dirs. Right now I'm running this to see what happens: # for i in {1..100}; do grep -r abc /cae/apps/data/testapp-1/tmp* /dev/null; done So far nothing much, but I just started. Some other points for comparison: - does the cache need to be fresh? I have a cron job that does this from time to time (about once a month with some random splay between machines) on these machines anyways (basically stop cachefilesd rm -rf the_cache_dir_contents start cachefilesd) - anything else in particular about the test I should look for? I've also noticed that I can usually get cachefilesd to spin to 100% cpu if I do something like this: # grep pattern /home/logs/some_multi_gb_large_readonly_logfile I recall seeing patches for large file support, but wasn't sure on their status. Anyways, that's digressing, so I'll leave that as a separate item for later. Thanks, Brian [1] http://modx.com/download/ signature.asc Description: Digital signature
Bug#682007: [squeeze-backports] NULL pointer dereference in __fscache_read_or_alloc_pages
Jonathan Nieder jrnie...@gmail.com 2012-07-21 12:04: tags 682007 + upstream patch moreinfo quit Hi, Brian Kroth wrote: kernel BUG at /build/buildd-linux_3.2.20-1~bpo60+1-amd64-tQMw4f/linux-3.2.20/fs/buffer.c:3088! This is BUG_ON(!PagePrivate(page)); \ in static int drop_buffers(struct page *page, struct buffer_head **buffers_to_free) { struct buffer_head *head = page_buffers(page); I suspect it's the same underlying problem, but maybe not. Please test the attached patches, for example following the instructions below: 0. prerequisites: apt-get install git build-essential 1. get the kernel history, if you don't already have it: git clone \ git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git I take it then that this is a patch against the latest greatest kernel, not against the source for the backports kernel I'm currently running? 2. configure, build, test: cd linux git fetch origin git checkout origin/master cp /boot/config-$(uname -r) .config; # current configuration scripts/config --disable DEBUG_INFO make localmodconfig; # optional: minimize configuration make deb-pkg; # optionally with -jnum for parallel build dpkg -i ../name of package; # as root reboot ... test test test ... Hopefully it reproduces the bug. So Oh, I see you want to compare two nearly identical kernels (both fairly recent) to isolate if just the patches are helpful rather than some mix between the two. 3. try the patches: cd linux git am -3sc $(ls -1 /path/to/patches/[01]*) make deb-pkg; # maybe with -j4 dpkg -i ../name of package; # as root reboot ... test test test ... Thanks and hope that helps, Jonathan I can try and build/install this on one of our machines (I'd prefer not to push it everywhere yet), but without a reliably reproducible (on-demand that is) test case I'm not sure what it will show except that it perhaps doesn't introduce further blatant bugs. Anyways, I'll wait on the results of my previous test first to see if we have a reliable test case from it before moving forward. At this point the grep -r abc ... test is just hitting the cache over and over again, so it's not showing a whole lot. One other thing I'd tried before was something like this run a couple of times in a row (hmm, I suppose I could try them in parallel too): find /fsc_mounted_nfs -type f -exec cat {} /dev/null \; A couple of them paniced, but with inconsistent messages, so I had left them out for now. Perhaps that's worth another shot ... Thanks, Brian signature.asc Description: Digital signature
Bug#682007: [squeeze-backports] NULL pointer dereference in __fscache_read_or_alloc_pages
merge 682116 682007 quit Hi, Ben Hutchings b...@decadent.org.uk 2012-07-19 13:32: On Thu, 2012-07-19 at 13:37 +0200, Bastian Blank wrote: We don't support proprietary stuff. Please remove and try again. To be clear, Bastian is referring to the proprietary kernel module (nvidia). I think this stance is too aggressive. Testing without the modules we do not support can certainly help, but in cases like this where the proprietary module is not likely to be related, I'd rather hear about problems earlier than have submitters wait until they have time to reproduce without. Luckily this has been reproduced without the nvidia module, so merging. Rhaoul writes: This is reproducable using grep -r abc * inside a directory with 9541 files (no sym- or hardlinks, no block or character special files) in 1524 directories (PHP MODX installation) Brian, is it reproducible for you, too? What is the newest kernel you tried that did not have this problem? Thanks, all. Jonathan -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20120720162557.GB2885@burratino
Bug#682007: [squeeze-backports] NULL pointer dereference in __fscache_read_or_alloc_pages
Jonathan Nieder jrnie...@gmail.com 2012-07-20 11:25: merge 682116 682007 quit Hi, Ben Hutchings b...@decadent.org.uk 2012-07-19 13:32: On Thu, 2012-07-19 at 13:37 +0200, Bastian Blank wrote: We don't support proprietary stuff. Please remove and try again. To be clear, Bastian is referring to the proprietary kernel module (nvidia). I think this stance is too aggressive. Testing without the modules we do not support can certainly help, but in cases like this where the proprietary module is not likely to be related, I'd rather hear about problems earlier than have submitters wait until they have time to reproduce without. Luckily this has been reproduced without the nvidia module, so merging. Rhaoul writes: This is reproducable using grep -r abc * inside a directory with 9541 files (no sym- or hardlinks, no block or character special files) in 1524 directories (PHP MODX installation) Brian, is it reproducible for you, too? Thanks for the test. Unfortunately, I'm on my way out of town for a couple of days and didn't have time to look at this closely. I'll get back to you with some proper results on this sometime next week. What is the newest kernel you tried that did not have this problem? I don't recall having this level of problems with 3.2.1-2~bpo60+1, but maybe it was just less frequent and I didn't notice it. Cheers, Brian signature.asc Description: Digital signature