Bug#682007: [squeeze-backports] NULL pointer dereference in __fscache_read_or_alloc_pages

2013-07-16 Thread Raoul Bhatia [IPAX]

Hi!


It seems that a couple of changes went into linux 3.8 [1] and 3.11 [2].

Maybe it is worth to upgrade to wheezy with a 3.9 backports kernel
or wait for a 3.11 release?

Cheers,
Raoul

[1] 
http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/fs/cachefiles?h=linux-3.8.y
[2] 
http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/fs/cachefiles



--
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/51e54553.5020...@ipax.at



Bug#682007: [squeeze-backports] NULL pointer dereference in __fscache_read_or_alloc_pages

2013-01-29 Thread Raoul Bhatia [IPAX]

On 2012-10-12 20:52, Brian Kroth wrote:

Brian Paul Kroth bpkr...@gmail.com 2012-10-11 14:06:

Jonathan Nieder jrnie...@gmail.com 2012-10-01 01:25:

snip/

Once again very sorry for the delay :(

I forgot to disable the DEBUG_INFO and kept filling up my build VMs 
disk during compile.  Then realized I had grabbed the 3.7 rc code, 
which these patches don't apply against.  git checkout 
remotes/stable/linux-3.2.y (results in head 
c74a5e1fe4d0672936c8fb63d7484dfeaa30669c and 3.2.28), seemed to fix 
that.

snip/
Anyways, I just started running that on a machine, so I'll let you 
know if I noticed anything there first before I think about pushing it 
to further places.


Thanks,
Brian


Got another panic using this kernel/set of patches.  The dump is 
attached.


Let me know if you need anything else.


Hi!

Has there been any progress regarding this issue?

Brian, are you right now using the fsc facility or not?
If yes, which patches / configure options are you currently using
and how often do you see kernel panics?

Are there any workarounds to this issue besides disabling fsc?

Thanks,
Raoul
--

DI (FH) Raoul Bhatia M.Sc.  email.  r.bha...@ipax.at
Technischer Leiter

IPAX - Aloy Bhatia Hava OEG web.  http://www.ipax.at
Barawitzkagasse 10/2/2/11   email.off...@ipax.at
1190 Wien   tel.   +43 1 3670030
FN 277995t HG Wien  fax.+43 1 3670030 15



--
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/b669fb912d6ba31617e96f5faa5ad...@ipax.at



Bug#682007: [squeeze-backports] NULL pointer dereference in __fscache_read_or_alloc_pages

2013-01-29 Thread Brian Kroth

Raoul Bhatia [IPAX] r.bha...@ipax.at 2013-01-29 11:01:

On 2012-10-12 20:52, Brian Kroth wrote:

Brian Paul Kroth bpkr...@gmail.com 2012-10-11 14:06:

Jonathan Nieder jrnie...@gmail.com 2012-10-01 01:25:

snip/

Once again very sorry for the delay :(

I forgot to disable the DEBUG_INFO and kept filling up my build 
VMs disk during compile.  Then realized I had grabbed the 3.7 rc 
code, which these patches don't apply against.  git checkout 
remotes/stable/linux-3.2.y (results in head 
c74a5e1fe4d0672936c8fb63d7484dfeaa30669c and 3.2.28), seemed to 
fix that.

snip/
Anyways, I just started running that on a machine, so I'll let 
you know if I noticed anything there first before I think about 
pushing it to further places.


Thanks,
Brian


Got another panic using this kernel/set of patches.  The dump is 
attached.


Let me know if you need anything else.


Hi!


Hello!


Has there been any progress regarding this issue?


Not really.  At least not that I'm aware of.


Brian, are you right now using the fsc facility or not?


Yes, with 54 mounts each on about 100 hosts.


If yes, which patches / configure options are you currently using
and how often do you see kernel panics?


Currently we're running this kernel most places:
ii linux-image-3.2.0-0.bpo.2-amd64 3.2.20-1~bpo60+1 Linux 3.2 for 64-bit PCs

With a few hosts gradually moving over to this:
ii linux-image-3.2.0-0.bpo.4-amd64 3.2.35-2~bpo60+1 Linux 3.2 for 64-bit PCs

And one host running 3.2.28 with the set of patches from here:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=682007#47

We've seen the panic on all of those kernels.  Since it's fairly recent, 
I've attached another dump from the bpo.4 3.2.35 kernel's panic.



Frequency and cause is a little difficult to tease out precisely.  These 
are lab machines and the workload may vary quite substantially based on 
what classes and compute jobs (eg: from condor) happen to be running on 
them at any given time.


Recently (as of the students returning a week and a half ago) we've seen 
this on 6 machines.


Past I see 37 other events in the last 90 days (our log rotation 
period).  Usually their clustered together, so probably tied to a 
particular job's workload.  Unfortunately, those jobs are usually gone 
by the time I see it.



Are there any workarounds to this issue besides disabling fsc?


Not that I'm aware of.

Let me know if you need anything else.

Thanks,
Brian
Jan 19 02:08:44 tux-116 [120882.927408] BUG: unable to handle kernel 
Jan 19 02:08:44 tux-116 NULL pointer dereference
Jan 19 02:08:44 tux-116 at 0040
Jan 19 02:08:44 tux-116 [120882.927421] IP:
Jan 19 02:08:44 tux-116 [a103c5f7] 
__fscache_read_or_alloc_pages+0x194/0x262 [fscache]
Jan 19 02:08:44 tux-116 [120882.927432] PGD 22120c067 
Jan 19 02:08:44 tux-116 PUD 22157d067 
Jan 19 02:08:44 tux-116 PMD 0 
Jan 19 02:08:44 tux-116 
Jan 19 02:08:44 tux-116 [120882.927440] Oops:  [#1] 
Jan 19 02:08:44 tux-116 SMP 
Jan 19 02:08:44 tux-116 
Jan 19 02:08:44 tux-116 [120882.927446] CPU 0 
Jan 19 02:08:44 tux-116 
Jan 19 02:08:44 tux-116 [120882.927449] Modules linked in:
Jan 19 02:08:44 tux-116 btrfs
Jan 19 02:08:44 tux-116 zlib_deflate
Jan 19 02:08:44 tux-116 libcrc32c
Jan 19 02:08:44 tux-116 ufs
Jan 19 02:08:44 tux-116 qnx4
Jan 19 02:08:44 tux-116 hfsplus
Jan 19 02:08:44 tux-116 hfs
Jan 19 02:08:44 tux-116 minix
Jan 19 02:08:44 tux-116 ntfs
Jan 19 02:08:44 tux-116 vfat
Jan 19 02:08:44 tux-116 msdos
Jan 19 02:08:44 tux-116 fat
Jan 19 02:08:44 tux-116 jfs
Jan 19 02:08:44 tux-116 xfs
Jan 19 02:08:44 tux-116 reiserfs
Jan 19 02:08:44 tux-116 ext2
Jan 19 02:08:44 tux-116 cpufreq_userspace
Jan 19 02:08:44 tux-116 cpufreq_powersave
Jan 19 02:08:44 tux-116 cpufreq_conservative
Jan 19 02:08:44 tux-116 cpufreq_stats
Jan 19 02:08:44 tux-116 autofs4
Jan 19 02:08:44 tux-116 cachefiles
Jan 19 02:08:44 tux-116 binfmt_misc
Jan 19 02:08:44 tux-116 kvm_intel
Jan 19 02:08:44 tux-116 kvm
Jan 19 02:08:44 tux-116 nfsd
Jan 19 02:08:44 tux-116 nfs
Jan 19 02:08:44 tux-116 lockd
Jan 19 02:08:44 tux-116 fscache
Jan 19 02:08:44 tux-116 auth_rpcgss
Jan 19 02:08:44 tux-116 nfs_acl
Jan 19 02:08:44 tux-116 sunrpc
Jan 19 02:08:44 tux-116 netconsole
Jan 19 02:08:44 tux-116 configfs
Jan 19 02:08:44 tux-116 ext3
Jan 19 02:08:44 tux-116 jbd
Jan 19 02:08:44 tux-116 dm_crypt
Jan 19 02:08:44 tux-116 sbs
Jan 19 02:08:44 tux-116 power_supply
Jan 19 02:08:44 tux-116 sbshc
Jan 19 02:08:44 tux-116 adt7475
Jan 19 02:08:44 tux-116 hwmon_vid
Jan 19 02:08:44 tux-116 ipmi_watchdog
Jan 19 02:08:44 tux-116 ipmi_devintf
Jan 19 02:08:44 tux-116 ipmi_si
Jan 19 02:08:44 tux-116 ipmi_msghandler
Jan 19 02:08:44 tux-116 fuse
Jan 19 02:08:44 tux-116 uhci_hcd
Jan 19 02:08:44 tux-116 ohci_hcd
Jan 19 02:08:44 tux-116 snd_hda_codec_realtek
Jan 19 02:08:44 tux-116 snd_hda_intel
Jan 19 02:08:44 tux-116 snd_hda_codec
Jan 19 02:08:44 tux-116 snd_hwdep
Jan 19 02:08:44 tux-116 snd_pcm_oss
Jan 19 02:08:44 tux-116 snd_mixer_oss
Jan 19 02:08:44 tux-116 tpm_infineon
Jan 19 

Bug#682007: [squeeze-backports] NULL pointer dereference in __fscache_read_or_alloc_pages

2012-07-26 Thread Brian Kroth

Jonathan Nieder jrnie...@gmail.com 2012-07-20 11:25:

merge 682116 682007
quit

Hi,

Ben Hutchings b...@decadent.org.uk 2012-07-19 13:32:

On Thu, 2012-07-19 at 13:37 +0200, Bastian Blank wrote:



We don't support proprietary stuff. Please remove and try again.


To be clear, Bastian is referring to the proprietary kernel module
(nvidia).


I think this stance is too aggressive.  Testing without the modules we
do not support can certainly help, but in cases like this where the
proprietary module is not likely to be related, I'd rather hear about
problems earlier than have submitters wait until they have time to
reproduce without.

Luckily this has been reproduced without the nvidia module, so
merging.

Rhaoul writes:

This is reproducable using grep -r abc * inside a directory with
   9541 files (no sym- or hardlinks, no block or character special 
files) in
   1524 directories
(PHP MODX installation)


I downloaded the couple of files from that site [1] and unzipped them to 
hopefully create a similar test setup.  I had to make two copies of it 
to get that many files/dirs.


Right now I'm running this to see what happens:
# for i in {1..100}; do grep -r abc /cae/apps/data/testapp-1/tmp*  /dev/null; 
done

So far nothing much, but I just started.  


Some other points for comparison:
- does the cache need to be fresh?  I have a cron job that does this 
  from time to time (about once a month with some random splay between 
  machines) on these machines anyways (basically stop cachefilesd  rm 
  -rf the_cache_dir_contents  start cachefilesd)

- anything else in particular about the test I should look for?


I've also noticed that I can usually get cachefilesd to spin to 100% cpu 
if I do something like this:


# grep pattern /home/logs/some_multi_gb_large_readonly_logfile

I recall seeing patches for large file support, but wasn't sure on their 
status.  Anyways, that's digressing, so I'll leave that as a separate 
item for later.


Thanks,
Brian

[1] http://modx.com/download/


signature.asc
Description: Digital signature


Bug#682007: [squeeze-backports] NULL pointer dereference in __fscache_read_or_alloc_pages

2012-07-26 Thread Brian Kroth

Jonathan Nieder jrnie...@gmail.com 2012-07-21 12:04:

tags 682007 + upstream patch moreinfo
quit

Hi,

Brian Kroth wrote:


kernel BUG at 
/build/buildd-linux_3.2.20-1~bpo60+1-amd64-tQMw4f/linux-3.2.20/fs/buffer.c:3088!


This is

BUG_ON(!PagePrivate(page)); \

in

static int
drop_buffers(struct page *page, struct buffer_head **buffers_to_free)
{
struct buffer_head *head = page_buffers(page);

I suspect it's the same underlying problem, but maybe not.

Please test the attached patches, for example following the instructions
below:

0. prerequisites:

apt-get install git build-essential

1. get the kernel history, if you don't already have it:

git clone \
  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git


I take it then that this is a patch against the latest greatest kernel, 
not against the source for the backports kernel I'm currently running?



2. configure, build, test:

cd linux
git fetch origin
git checkout origin/master

cp /boot/config-$(uname -r) .config; # current configuration
scripts/config --disable DEBUG_INFO
make localmodconfig; # optional: minimize configuration
make deb-pkg; # optionally with -jnum for parallel build
dpkg -i ../name of package; # as root
reboot
... test test test ...

  Hopefully it reproduces the bug.  So


Oh, I see you want to compare two nearly identical kernels (both fairly 
recent) to isolate if just the patches are helpful rather than some mix 
between the two.



3. try the patches:

cd linux
git am -3sc $(ls -1 /path/to/patches/[01]*)
make deb-pkg; # maybe with -j4
dpkg -i ../name of package; # as root
reboot
... test test test ...

Thanks and hope that helps,
Jonathan


I can try and build/install this on one of our machines (I'd prefer not 
to push it everywhere yet), but without a reliably reproducible 
(on-demand that is) test case I'm not sure what it will show except that 
it perhaps doesn't introduce further blatant bugs.


Anyways, I'll wait on the results of my previous test first to see if we 
have a reliable test case from it before moving forward.


At this point the grep -r abc ... test is just hitting the cache over 
and over again, so it's not showing a whole lot.


One other thing I'd tried before was something like this run a couple of 
times in a row (hmm, I suppose I could try them in parallel too):


find /fsc_mounted_nfs -type f -exec cat {}  /dev/null \;

A couple of them paniced, but with inconsistent messages, so I had left 
them out for now.  Perhaps that's worth another shot ...


Thanks,
Brian


signature.asc
Description: Digital signature


Bug#682007: [squeeze-backports] NULL pointer dereference in __fscache_read_or_alloc_pages

2012-07-20 Thread Jonathan Nieder
merge 682116 682007
quit

Hi,

Ben Hutchings b...@decadent.org.uk 2012-07-19 13:32:
 On Thu, 2012-07-19 at 13:37 +0200, Bastian Blank wrote:

 We don't support proprietary stuff. Please remove and try again.

 To be clear, Bastian is referring to the proprietary kernel module
 (nvidia).

I think this stance is too aggressive.  Testing without the modules we
do not support can certainly help, but in cases like this where the
proprietary module is not likely to be related, I'd rather hear about
problems earlier than have submitters wait until they have time to
reproduce without.

Luckily this has been reproduced without the nvidia module, so
merging.

Rhaoul writes:

This is reproducable using grep -r abc * inside a directory with
   9541 files (no sym- or hardlinks, no block or character special 
files) in
   1524 directories
(PHP MODX installation)

Brian, is it reproducible for you, too?  What is the newest kernel you
tried that did not have this problem?

Thanks, all.
Jonathan


-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120720162557.GB2885@burratino



Bug#682007: [squeeze-backports] NULL pointer dereference in __fscache_read_or_alloc_pages

2012-07-20 Thread Brian Kroth

Jonathan Nieder jrnie...@gmail.com 2012-07-20 11:25:

merge 682116 682007
quit

Hi,

Ben Hutchings b...@decadent.org.uk 2012-07-19 13:32:

On Thu, 2012-07-19 at 13:37 +0200, Bastian Blank wrote:



We don't support proprietary stuff. Please remove and try again.


To be clear, Bastian is referring to the proprietary kernel module
(nvidia).


I think this stance is too aggressive.  Testing without the modules we
do not support can certainly help, but in cases like this where the
proprietary module is not likely to be related, I'd rather hear about
problems earlier than have submitters wait until they have time to
reproduce without.

Luckily this has been reproduced without the nvidia module, so
merging.

Rhaoul writes:

This is reproducable using grep -r abc * inside a directory with
   9541 files (no sym- or hardlinks, no block or character special 
files) in
   1524 directories
(PHP MODX installation)

Brian, is it reproducible for you, too?  


Thanks for the test.  Unfortunately, I'm on my way out of town for a 
couple of days and didn't have time to look at this closely.  I'll get 
back to you with some proper results on this sometime next week.



What is the newest kernel you tried that did not have this problem?


I don't recall having this level of problems with 3.2.1-2~bpo60+1, but 
maybe it was just less frequent and I didn't notice it.


Cheers,
Brian


signature.asc
Description: Digital signature