Hi Terry, Unfortunately you did not say if you have omfs enabled or not. I have a cluster of about 40 machines (user nodes) with openafs and openmosix but with omfs disabled on all of them. THe openmosix options I use are for programs that do IO to become locked to the originating node, this fixed most of the perl problems for me. I would suggest to try to disable omfs, if it is enabled..
Thanks Clement Onime > > 1. Re: OpenAFS on 2.4.26 ? OpenMosix ? (Terry Gliedt) > 2. Re: OpenAFS on 2.4.26 ? OpenMosix ? (Jeffrey Hutzelman) > 3. Re: 1.3.75 on FC3 (Matthew N. Andrews) > > --__--__-- > > Message: 1 > Date: Wed, 15 Dec 2004 14:02:26 -0500 > From: Terry Gliedt <[EMAIL PROTECTED]> > Organization: Biostatistics > To: [EMAIL PROTECTED] > Subject: Re: [OpenAFS-devel] OpenAFS on 2.4.26 ? OpenMosix ? > > The previous post has not been the last of the story. We tried one more > time, this time moving to a 2.4.27 kernel and OpenMosix > patch-2.4.27-om-20041102.bz2. The OpenAFS code remained unchanged. > > We discovered that pinning a task to a particular processor allowed the > tasks to run to completion. At the same time we discovered that using > migrate to move a task from one processor to another (even for identical > machines hardware-wise), resulted in a segment fault. > > Eventually we found three examples of code for testing. Two failed every > time, sometimes quickly, sometimes not. One thing they had in common was > the use of Perl. We speculated the problem was related to Perl threads, > even though the Perl code was very simple and used no threads. I created > my own version of Perl, with no threads and all instances of failures > stopped. > > Some of you may not be surprised by this, but I sure was. Obviously > there is something in using a thread-enabled Perl which just does not > work in OpenMosix. In our experience migrating a task using a > thread-enabled Perl will fail 100% of the time. > > We've replaced FC1 Perl and have a more stable environment. We enabled > OpenAFS for this environment and have had pretty good success, but not > complete success. Obtaining tokens at login behaves just as we wanted - > we're out of the password business. > > Reading AFS data seems to be solid. We've not noticed any failures in > the cache or in copying data (this is hardly a completely solid > endorsement, but so far, so good). > > Writing into AFS volumes, however, is not always successful. Sometimes > the program (e.g. cp) doing the writing will segment fault. I've seen > various other write failures that I think had to do with locking, but > exactly what was going on was unclear. > > In one case I got a segment fault in cp and retried the command. The > kernel got seriously 'sick'. In /var/log/messages I found the messages > below. The machine has very unresponsive, to the point I rebooted. Nasty!! > > The problem could possibly be in OpenMosix (whose mailing list I will > also post to), but I thought I should tell you folks of my experience in > case it rings a bell. If anyone is interested in pursuing this further I > can probably arrange some testing. These problems all seem pretty common > and can often be reproduced. > > > ####### from /var/log/messages Watch for line wraps > > Unable to handle kernel NULL pointer dereference at virtual address > 00000004 > printing eip: > f8b73af8 > *pde = 2bcc0001 > *pte = 00000000 > Oops: 0000 > CPU: 2 > EIP: 0010:[<f8b73af8>] Tainted: PF > EFLAGS: 00010282 > eax: 20003312 ebx: f8c4be14 ecx: ec6b5dfc edx: 00000000 > esi: f8c4c038 edi: ec6b5da0 ebp: ec6b5da0 esp: ecbbfe40 > ds: 0018 es: 0018 ss: 0018 > Process cp (pid: 3288, stackpage=ecbbf000) > Stack: f9417000 ecbbe000 00000000 f8c4be14 f8c4c038 ecbbfe90 ec6b5da0 > f8b776b2 > ec6b5da0 ec6b5dfc 00000002 ecbbfe90 c0360a00 ec71ad20 00000001 > f9417000 > ec6b5dfc f8c4c038 ec6b5dfc 0000ffff 0001e194 00000040 f8ba22c0 > f8b78a00 > Call Trace: [<f8b776b2>] [<f8ba22c0>] [<f8b78a00>] [<c01611ed>] > [<c0161a22>] > [<c01620c9>] [<c0162429>] [<c0153443>] [<c016c8d1>] [<c0155f88>] > [<c01befd5>] > [<c01bf0df>] [<c010b8bc>] > > Code: 39 42 04 0f 84 c7 00 00 00 e8 3a e7 ff ff 89 c5 50 8d 44 24 > > > > > Terry Gliedt wrote: >> This is a followup on my experience with OpenAFS and OpenMosix. I moved >> user's HOME to a local disk, rather than in AFS and got everything >> configured as I wanted. Then I opened the machines (one gateway + one >> dedicated node in the cluster) to one user. >> >> She started a simulation which consisted of a Perl program driving a C >> program running it several tens of thousands of times. The program was >> running on the remote cluster. This is a computationally heavy task with >> very little in or out I/O (typical for our world). The program was not >> running in an AFS directory, but in a directory on a local disk. >> >> After ten minutes or so her task segment faulted. This same software has >> been running on several dozen other machines for the past several weeks, >> so it's not her problem. I disabled AFS in the rc.d scripts and >> rebooted. The same tasks have been running for three days. >> >> I'm afraid there is some fairly basic interaction between OpenAFS and >> OpenMosix. I have a small window of opportunity to get some debug >> information if someone wants to pursue this - just give me the details >> of what you need (and how to get them). >> >> Details: >> >> Fedora Core 1 >> 2.4.26 kernel >> patch-2.4.26-om-20041102.bz2 for OpenMosix >> OpenAFS 1.3.73 >> >> >> Terry Gliedt wrote: >> >>> Miles Davis wrote: >>> >>>> On Tue, Nov 09, 2004 at 09:15:44AM -0500, Terry Gliedt wrote: >>>> >>>>> I can now confirm the combination of a 2.4.26 kernel + 1.3.73 >>>>> OpenAFS works just fine. Adding OpenMosix will immediately results >>>>> in this symptom: >>>>> >>>>> SSH with X11 forwarding to OpenMosix+OpenAFS machine >>>>> Observe messages about a fail in locking .Xauthority file >>>>> >>>>> What apparently is happening is that as X11 attempts to add a new >>>>> entry to .Xauthority, it creates .Xauthority-n and presumably does a >>>>> move which fails. This results in the user's .Xauthority >>>>> "disappearing". A simple 'mv .Xauthority-n .Xauthority' allows X11 >>>>> to work properly again. >>>>> >>>>> I presume this has something to do with locking, but that's just my >>>>> guess. I've seen other strangeness in AFS behavior also which may be >>>>> related (or not), however the ssh scenario I mention above has been >>>>> my lithmus test. >>>> >>>> >>>> >>>> >>>> I've had that happen several times on 1.3.73 clients, so it probably >>>> has nothing to do with openMosix. I haven't tried 1.3.74 yet, but you >>>> should probably give that a try. >>> >>> >>> >>> Well, I did, but that did not help. I really believe this is an >>> interaction between OpenAFS and OpenMosix. If I apply OpenAFS 1.3.73 >>> to a pure linux 2.4.26 kernel, AFS behaves as expected. Adding >>> OpenMosix definately causes the problem. Thanks for the thought. >>> >> >> > > > -- > ============================================================= > Terry Gliedt [EMAIL PROTECTED] http://www.hps.com/~tpg/ > Biostatistics, Univ of Michigan Personal Email: [EMAIL PROTECTED] > > --__--__-- > > Message: 2 > Date: Wed, 15 Dec 2004 14:48:32 -0500 > From: Jeffrey Hutzelman <[EMAIL PROTECTED]> > To: Terry Gliedt <[EMAIL PROTECTED]>, [EMAIL PROTECTED] > Subject: Re: [OpenAFS-devel] OpenAFS on 2.4.26 ? OpenMosix ? > > > > On Wednesday, December 15, 2004 14:02:26 -0500 Terry Gliedt > <[EMAIL PROTECTED]> > wrote: > >>####### from /var/log/messages Watch for line wraps >> >> Unable to handle kernel NULL pointer dereference at virtual address >> 00000004 printing eip: >> f8b73af8 >> *pde = 2bcc0001 >> *pte = 00000000 >> Oops: 0000 >> CPU: 2 >> EIP: 0010:[<f8b73af8>] Tainted: PF >> EFLAGS: 00010282 >> eax: 20003312 ebx: f8c4be14 ecx: ec6b5dfc edx: 00000000 >> esi: f8c4c038 edi: ec6b5da0 ebp: ec6b5da0 esp: ecbbfe40 >> ds: 0018 es: 0018 ss: 0018 >> Process cp (pid: 3288, stackpage=ecbbf000) >> Stack: f9417000 ecbbe000 00000000 f8c4be14 f8c4c038 ecbbfe90 ec6b5da0 >> f8b776b2 ec6b5da0 ec6b5dfc 00000002 ecbbfe90 c0360a00 ec71ad20 >> 00000001 f9417000 ec6b5dfc f8c4c038 ec6b5dfc 0000ffff 0001e194 >> 00000040 f8ba22c0 f8b78a00 Call Trace: [<f8b776b2>] [<f8ba22c0>] >> [<f8b78a00>] [<c01611ed>] [<c0161a22>] [<c01620c9>] [<c0162429>] >> [<c0153443>] [<c016c8d1>] [<c0155f88>] [<c01befd5>] [<c01bf0df>] >> [<c010b8bc>] >> >> Code: 39 42 04 0f 84 c7 00 00 00 e8 3a e7 ff ff 89 c5 50 8d 44 24 > > That's not surprising. In all of the cases you described where a process > randomly seg faults, you should see output like that in /var/log/messages > or in dmesg output. There are a wide variety of bad things that, if user > code does them, cause the program to exit on a signal like SIGSEGV or > SIGBUS, and drop a core file. In Linux, if one of these things happens in > kernel code, the process exits on SIGSEGV (no core), and you get an "oops" > message which contains information about the state of the kernel at the > time of the failure. That's what the message you quoted is. > > Unfortunately, the oops message is not useful in its raw form. All of the > numbers you see in [<>] are actually addresses inside the kernel. In > order > for the backtrace to be useful, these need to be converted to symbolic > form. This is usually done automatically by the logging software, if it > can find the kernel symbol table, which is usually available in a file > called "System.map". Since the conversion did not happen automatically, > you will need to either find and use ksymoops, or reconfigure the kernel > logging software to do the translation, and then reproduce the problem > again. > > The simplest thing to do is to make sure that klogd is able to find the > System.map file, and that it is not invoked with -x. You will probably > get > the best results by running klogd with -p, so it will reload symbol table > information when it sees an error (otherwise it may not have a complete > set > of symbols for openafs). > > > FWIW, I have not heard of anyone getting OpenAFS and OpenMosix to work > together, even to the extent that you've reported so far. We have had > several reports of failures in the past, though... > > -- Jeffrey T. Hutzelman (N3NHS) <[EMAIL PROTECTED]> > Sr. Research Systems Programmer > School of Computer Science - Research Computing Facility > Carnegie Mellon University - Pittsburgh, PA > > > --__--__-- > > Message: 3 > Date: Wed, 15 Dec 2004 17:36:46 -0800 > From: "Matthew N. Andrews" <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Subject: Re: [OpenAFS-devel] 1.3.75 on FC3 > > d'oh, > > here's some more info on my problems with openafs on FC3 x86_64 > > after looking at dmesg and slapping my forhead I see: > > libafs: Unknown symbol ia32_sys_call_table > > at this point I looked at acinclude.m4, and tried this patch to force the > test > for ia32_sys_cal_table to fail: > > ---- cut here ---- > --- acinclude.m4 2004-12-13 11:40:42.000000000 -0800 > +++ acinclude.m4.no_ia32_sys_call_table 2004-12-15 16:31:22.093260576 > -0800 > @@ -579,9 +579,7 @@ > if test "x$ac_cv_linux_config_modversions" = "xno" -o > $AFS_SYSKVERS -ge 26; then > AC_MSG_WARN([Cannot determine sys_call_table status. > assuming it isn't exported]) > ac_cv_linux_exports_sys_call_table=no > - if test -f > "$LINUX_KERNEL_PATH/include/asm/ia32_unistd.h"; then > - ac_cv_linux_exports_ia32_sys_call_table=yes > - fi > + ac_cv_linux_exports_ia32_sys_call_table=no > else > LINUX_EXPORTS_INIT_MM > LINUX_EXPORTS_KALLSYMS_ADDRESS > ---- cut here ---- > > this then causes the make to fail when compiling the libafs module with > these > errors: > > CC [M] > /usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP/AFS_component_version_number.o > CC [M] > /usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP/osi_module.o > /usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP/osi_module.c: > In function `afs_init': > /usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP/osi_module.c:453: > warning: `interruptible_sleep_on' is deprecated (declared at > include/linux/wait.h:290) > /usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP/osi_module.c:462: > error: `sys_exit' undeclared (first use in this function) > /usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP/osi_module.c:462: > error: (Each undeclared identifier is reported only once > /usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP/osi_module.c:462: > error: for each function it appears in.) > /usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP/osi_module.c:463: > error: `sys_open' undeclared (first use in this function) > /usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP/osi_module.c:464: > warning: assignment from incompatible pointer type > make[6]: *** > [/usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP/osi_module.o] > Error 1 > make[5]: *** > [_module_/usr/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP] > Error 2 > make[5]: Leaving directory `/lib/modules/2.6.9-1.678_FC3smp/build' > make[4]: *** [libafs.ko] Error 2 > make[4]: Leaving directory > `/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs/MODLOAD-2.6.9-1.678_FC3smp-MP' > make[3]: *** [linux_compdirs] Error 2 > make[3]: Leaving directory > `/home/ma3d/rpmbuild/BUILD/openafs-1.3.76/src/libafs' > make[2]: *** [libafs] Error 2 > make[2]: Leaving directory `/home/ma3d/rpmbuild/BUILD/openafs-1.3.76' > make[1]: *** [build] Error 2 > make[1]: Leaving directory `/home/ma3d/rpmbuild/BUILD/openafs-1.3.76' > make: *** [all] Error 2 > > Is there a way to get the current openafs code to work on a machine which > has > neither ia32_sys_call_table, nor sys_call_table? > > -Matt > > > Matthew N. Andrews wrote: >> hello, >> >> after getting 1.3.75 to compile on a dual processor x86_64 FC3 machine, >> I am now stuck with a module that fails to load with the following >> error: >> >> # insmod /usr/vice/etc/modload/libafs-2.6.9-1.678_FC3smp-amd64.ko >> insmod: error inserting >> '/usr/vice/etc/modload/libafs-2.6.9-1.678_FC3smp-amd64.ko': -1 Unknown >> symbol in module >> >> >> I remember others seeing this same error earlier on thelist, but >> couldn't find a reference to what the problem was then. anyone have any >> ideas? >> >> thanks for any help. >> >> -Matthew Andrews >> _______________________________________________ >> OpenAFS-devel mailing list >> [EMAIL PROTECTED] >> https://lists.openafs.org/mailman/listinfo/openafs-devel >> >> > > > > --__--__-- > > _______________________________________________ > OpenAFS-devel mailing list > [EMAIL PROTECTED] > https://lists.openafs.org/mailman/listinfo/openafs-devel > > > End of OpenAFS-devel Digest > _______________________________________________ OpenAFS-devel mailing list [EMAIL PROTECTED] https://lists.openafs.org/mailman/listinfo/openafs-devel
