As I am in a situation similar to Jason, and the OpenAFS developers want feedback, here are my recent results.

I am attempting to use both the pre-build Fedora Core 3 RPMs of OpenAFS 1.3.76 as well as my own build to set up new servers here at UCD. I'm running on the latest/greatest stock FC3 kernel (Linux correct 2.6.9-1.681_FC3 #1 Thu Nov 18 15:10:10 EST 2004 i686 i686 i386 GNU/Linux). The systems are all P3s with big disks and moderate amounts of memory (512MB).

My previous mails to this list focused on bug fixes for various seg faults in AFS binaries. It seems those issues are now fixed in recent releases, but I still experience more segfault problems.

In particular, I just went through the full install process again and got as far as checking my mounts for my cell's root before experiencing my first segfault:

correct:~# fs examine /afs/
File /afs/ (536870912.1.1) contained in volume 536870912
Volume status for vid = 536870912 named root.afs
Current disk quota is 5000
Current blocks used are 4
The partition has 27299690 blocks available out of 27377624

correct:~# fs examine /afs/cs.ucd.ie
Segmentation fault

An strace reveals the segfault happens at/during an ioctl call:
...
open("/root/.AFSSERVER", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/.AFSSERVER", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/proc/fs/openafs/afs_ioctl", O_RDWR) = 3
ioctl(3, CAPI_REGISTER or SNDCTL_COPR_LOAD <unfinished ...>
+++ killed by SIGSEGV +++


And /var/log/messages contains the following kernel fault:
Dec 21 15:02:32 correct kernel: <1>Unable to handle kernel NULL pointer dereference at virtual address 000000\18
Dec 21 15:02:32 correct kernel: printing eip:
Dec 21 15:02:32 correct kernel: 021c55bc
Dec 21 15:02:32 correct kernel: *pde = 00000000
Dec 21 15:02:32 correct kernel: Oops: 0000 [#6]
Dec 21 15:02:32 correct kernel: Modules linked in: libafs(U) md5 ipv6 iptable_filter ip_tables dm_mod button b\attery ac uhci_hcd snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc game\port snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore 3c59x floppy ext3 jbd
Dec 21 15:02:32 correct kernel: CPU: 0
Dec 21 15:02:32 correct kernel: EIP: 0060:[<021c55bc>] Tainted: P VLI
Dec 21 15:02:32 correct kernel: EFLAGS: 00010246 (2.6.9-1.681_FC3)
Dec 21 15:02:32 correct kernel: EIP is at inode_has_perm+0x38/0x54
Dec 21 15:02:32 correct kernel: eax: 00000000 ebx: 1f840db8 ecx: 00000000 edx: 00000000
Dec 21 15:02:32 correct kernel: esi: 22be9000 edi: 1f840df0 ebp: 04482d60 esp: 1f840db4
Dec 21 15:02:32 correct kernel: ds: 007b es: 007b ss: 0068
Dec 21 15:02:32 correct kernel: Process fs (pid: 2847, threadinfo=1f840000 task=05a3f330)
Dec 21 15:02:32 correct kernel: Stack: 00100000 00000001 00000000 00000000 00000000 22be9000 00000000 00000000\
Dec 21 15:02:32 correct kernel: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 05a36090\
Dec 21 15:02:32 correct kernel: 1f840dfc 0235f460 00000000 22be9000 00000001 021c71a5 00000000 0235e720\
Dec 21 15:02:32 correct kernel: Call Trace:
Dec 21 15:02:32 correct kernel: [<021c71a5>] selinux_inode_permission+0x9d/0xa2
Dec 21 15:02:32 correct kernel: [<021750ce>] permission+0x41/0x46
Dec 21 15:02:32 correct kernel: [<0217595b>] link_path_walk+0x120/0x1009
Dec 21 15:02:32 correct kernel: [<0215ef01>] copy_str_fromuser_size+0x3d/0x56
Dec 21 15:02:32 correct kernel: [<02176abf>] path_lookup+0xff/0x12f
Dec 21 15:02:32 correct kernel: [<02176c03>] __user_walk+0x21/0x51
Dec 21 15:02:32 correct kernel: [<22bb9abd>] osi_lookupname+0x21/0x77 [libafs]
Dec 21 15:02:32 correct kernel: [<22bc28d6>] afs_syscall_pioctl+0x86/0xf8 [libafs]
Dec 21 15:02:32 correct kernel: [<22bbfe07>] afs_syscall+0x16d/0x2b5 [libafs]
Dec 21 15:02:32 correct kernel: [<0215222e>] follow_page_pte+0xec/0xfd
Dec 21 15:02:32 correct kernel: [<22bba47e>] afs_ioctl+0x41/0x4f [libafs]
Dec 21 15:02:32 correct kernel: [<0217a4f6>] file_ioctl+0xf2/0x105
Dec 21 15:02:32 correct kernel: [<0217a77f>] sys_ioctl+0x276/0x337
Dec 21 15:02:32 correct kernel: Code: <3>Debug: sleeping function called from invalid context at include/linux\/rwsem.h:43
Dec 21 15:02:32 correct kernel: in_atomic():0[expected: 0], irqs_disabled():1
Dec 21 15:02:32 correct kernel: [<0211cbcb>] __might_sleep+0x7d/0x8a
Dec 21 15:02:32 correct kernel: [<0215e726>] rw_vm+0x20e/0x47a
Dec 21 15:02:32 correct kernel: [<021c5591>] inode_has_perm+0xd/0x54
Dec 21 15:02:32 correct kernel: [<021c5591>] inode_has_perm+0xd/0x54
Dec 21 15:02:32 correct kernel: [<0215ee70>] get_user_size+0x30/0x57
Dec 21 15:02:32 correct kernel: [<021c5591>] inode_has_perm+0xd/0x54
Dec 21 15:02:32 correct kernel: [<0210682b>] show_registers+0x109/0x15e
Dec 21 15:02:32 correct kernel: [<02106a2f>] die+0x14a/0x241
Dec 21 15:02:32 correct kernel: [<0211937e>] do_page_fault+0x0/0x511
Dec 21 15:02:32 correct kernel: [<0211937e>] do_page_fault+0x0/0x511
Dec 21 15:02:32 correct kernel: [<02119733>] do_page_fault+0x3b5/0x511
Dec 21 15:02:32 correct kernel: [<021c55bc>] inode_has_perm+0x38/0x54
Dec 21 15:02:32 correct kernel: [<021c3f3a>] avc_has_perm_noaudit+0x8d/0xda
Dec 21 15:02:32 correct kernel: [<02185f20>] update_atime+0x4d/0x90
Dec 21 15:02:32 correct kernel: [<021c3f3a>] avc_has_perm_noaudit+0x8d/0xda
Dec 21 15:02:32 correct kernel: [<22ba27f5>] afs_GetVolume+0x19/0x51 [libafs]
Dec 21 15:02:32 correct kernel: [<22b9512b>] afs_CopyOutAttrs+0x1df/0x1e5 [libafs]
Dec 21 15:02:32 correct kernel: [<0211937e>] do_page_fault+0x0/0x511
Dec 21 15:02:32 correct kernel: [<021c55bc>] inode_has_perm+0x38/0x54
Dec 21 15:02:32 correct kernel: [<021c71a5>] selinux_inode_permission+0x9d/0xa2
Dec 21 15:02:32 correct kernel: [<021750ce>] permission+0x41/0x46
Dec 21 15:02:32 correct kernel: [<0217595b>] link_path_walk+0x120/0x1009
Dec 21 15:02:32 correct kernel: [<0215ef01>] copy_str_fromuser_size+0x3d/0x56
Dec 21 15:02:32 correct kernel: [<02176abf>] path_lookup+0xff/0x12f
Dec 21 15:02:32 correct kernel: [<02176c03>] __user_walk+0x21/0x51
Dec 21 15:02:32 correct kernel: [<22bb9abd>] osi_lookupname+0x21/0x77 [libafs]
Dec 21 15:02:32 correct kernel: [<22bc28d6>] afs_syscall_pioctl+0x86/0xf8 [libafs]
Dec 21 15:02:32 correct kernel: [<22bbfe07>] afs_syscall+0x16d/0x2b5 [libafs]
Dec 21 15:02:32 correct kernel: [<0215222e>] follow_page_pte+0xec/0xfd
Dec 21 15:02:32 correct kernel: [<22bba47e>] afs_ioctl+0x41/0x4f [libafs]
Dec 21 15:02:32 correct kernel: [<0217a4f6>] file_ioctl+0xf2/0x105
Dec 21 15:02:32 correct kernel: [<0217a77f>] sys_ioctl+0x276/0x337
Dec 21 15:02:32 correct kernel: Bad EIP value.


On 10 Dec, 2004, at 21:37, Jason McCormick wrote:

--On Monday, December 06, 2004 07:48:52 PM +0100 Jeffrey Altman
<[EMAIL PROTECTED]> wrote:

The thing which is preventing the release of 1.3.7x as a stable 1.4
tree is lack of deployment and testing by users.  There has been very
little feedback both positive or negative on the existing releases.
Without this feedback it is very difficult for us to know whether or
not it is ready.

I'd been holding back our feedback because 1.3.75 was imminent and some
of the fixes listed we though might fix our problems. We've done testing
with 1.3.74 and 1.3.75. The clients are all Fedora Core 3 w/ patched
kernels to provide sys_call_table[]. We are experiencing the following
problems:



* Inability to unmount /usr/vice/cache (or / if it's not a separate
partition). This is 100% repeatable on all FC3 machines. The following
steps will always create this problem:


      - Stop all processes and logout all users of AFS
      - Stop all AFS processes and unload libafs kernel module
      - lsof | grep -i afs reports nothing open
      - umount /usr/vice/cache

This will always result in an error that /usr/vice/cache is busy:

      # umount /usr/vice/cache
      umount: /usr/vice/cache: device is busy
      umount: /usr/vice/cache: device is busy

* Accessing an AFS volume over our VPN results in an immediate kernel
panic. The panic message reports many "Unable to handle kernel NULL
pointer deference at virtual address" errors followed by "Recursive die()
failure, output suppressed" and "<0>Kernel panic - not syncing: Fatal
exception in interrupt". This is present only on 1 of 2 laptops running
FC3, but is 100% repeatable on the failing laptop.


* Copying large files (~450Mb0 into AFS from non-AFS partitions results
in a kernel oops. The error reported is:


rxi_Start: xmit list overflowed<1>Unable to handle kernel paging request
at virtual address ffffffff


This problem is also 100% repeatable. 'fs getcache' does not report that
the cache is full. I've attached a file gti-largefile-copy-oops.txt that
is the "soft" kernel oops.


* Random cache consistency problems. A file will be present in the
filesystem and viewable on other machines but not on the FC3 host. fs
flush does not always solve this problem however another client operating
on the same directory (i.e. touch hi) seems to unstick the client. We do
have one test case that seems to always generate this problem, but it's not
very portable for other to test as it requires our internal package
management software. Rudy Maceyko is going to test this with 1.3.75
shortly.


These are our current problems with the 1.3.7x series. We have not
tested 1.3.7x on any other Linux release because we're focusing on moving
forward with Fedora 3 and RHEL 4 preparations. So I can't speak to these
problems existing on, for example, FC1.


We are building the RPMs with a modified specfile. We're working to
merge our changes back into the mainline spec file and provide that to the
community. I've attached all of the patches we're applying to the source
tree since they're all small. Their descriptions are:


  openafs-1.2.11-no_old_gid_t.patch - Support for AMD 64

  openafs-1.2.11-res_search.patch - resolver patch

  openafs-1.3.75-afskvers-autoconf-fix.patch - Fix --with-afs-system

  26syscall.patch - Hard-sets the build process to use sys_call_table

  afs.initd.patch - Removes modload logic in favor of symlinks
                    to /lib/modules

  openafs-krb5-2.0-afsconf.patch - Fixes call to afsconf_AddKey()
                                   for afs-krb5

I've held off reporting this for a little bit because I've not had time to
properly test or debug any of these. Let me know what we can do to further
debug these problems.


-- Jason McCormick
CERT Infrastructure Team
[EMAIL PROTECTED] ** 412-268-7961
<gti-largefile-copy- oops.txt><26syscall.patch><afs.initd.patch><openafs-1.2.11- no_old_gid_t.patch><openafs-1.2.11-res_search.patch><openafs-1.3.74- admin_tools.klog.patch><openafs-krb5-2.0-afsconf.patch><openafs -1.3.75-afskvers-autoconf-fix.patch>

_______________________________________________ OpenAFS-info mailing list [EMAIL PROTECTED] https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to