Okay this is my first time messing with lustre and I've had my first
real problem that hasn't been discussed (at least I haven't found
searching bugzilla and discuss/devel mls). So I'd like to post my
results.

I'm using the latest xen kernel on ia64 and attempting to patch lustre
into that (2.6.16.33).

I've been using the lustre 1.4.7.3 version for a while and use the
sles10 series of patches.  The sles10 series of patches are close but
the vfs_intent patch doesn't patch quite right missing a define in
fs.h then everything works fine.

====
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 872042f..95487ec 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -265,6 +265,7 @@ typedef void (dio_iodone_t)(struct kiocb
#define ATTR_KILL_SUID 2048
#define ATTR_KILL_SGID 4096
#define ATTR_FILE      8192
+#define ATTR_NO_BLOCK  32768   /* Return EAGAIN and don't block on
long truncates */

/*
 * This is the Inode Attributes structure, used for notify_change().  It
====

Also there were tons of #if <insert define here> that were configure
script defines that should be done as #ifdef not #if. I know there was
a lot of these fixed but there's more...

Also using __WORDSIZE vs. BITS_PER_LONG it'd be really nice if
everything used one or the other, not both... I got lots of __WORDSIZE
not defined warnings in the lustre_user.h file which happens to be
included by tons of other files.  So every compiled source comes up
with this warning and that makes me nervous.

And finally the d_instantiate_unique check in configure came up wrong
for me. 2.6.16.33's d_instantiate_unique is the same as 2.6.15.7 and
2.6.17.14 is the one that has the fix so in lustre/llite/namei.c it
really should check to make sure the version is strictly less than
2.6.17 not 16 and also the configure check should be fixed to make
sure it works, I'd give you a patch to do that but I haven't looked
into how the check is done.

But after I fixed all the warnings and all the patches applied
successfully I was able to start lustre and it worked fine. A
distrubuted filesystem seems rather ridiculas on one box but it worked
okay, until I tried to put data into it. I downloaded a 250M iso off
the web to toss into the filesystem then I pulled down the md5sum to
check to make sure I had it all and put the file into the lustre
filesystem after verifying that it was all there. Then I md5sum'ed the
iso in the lustre filesystem and it came up wrong, I did it a second
time and it came up different from the first and still wrong. I didn't
notice anything from dmesg about lustre when this was happening.

Then when I tried to bring down the filesystem on the client the mds
crashed and this error came out in dmesg.

[57940.076201] LustreError:
3891:0:(socklnd.c:1287:ksocknal_close_conn_locked())
ASSERTION(peer->ksnp_error == 0) failed
[57940.076235] LustreError:
3891:0:(tracefile.c:419:libcfs_assertion_failed()) LBUG
[57940.076246] Lustre:
3891:0:(linux-debug.c:156:libcfs_debug_dumpstack()) showing stack for
process 3891
[57940.076255] socknal_sd00  R  running task       0  3891      1
3892  3767 (L-TLB)
[57940.076284]
[57940.076286] Call Trace:
[57940.076350] scheduling while atomic: socknal_sd00/0x00000100/3891
[57940.076384]
[57940.076386] Call Trace:
[57940.076410]  [<a00000010001b490>] show_stack+0x50/0xa0
[57940.076413]                                 sp=e00000000afb7b60
bsp=e00000000afb1290
[57940.076433]  [<a00000010001b510>] dump_stack+0x30/0x60
[57940.076436]                                 sp=e00000000afb7d30
bsp=e00000000afb1278
[57940.076464]  [<a0000001004e0160>] schedule+0xa0/0x1400
[57940.076466]                                 sp=e00000000afb7d30
bsp=e00000000afb11a8
[57940.076542]  [<a0000002002868b0>] libcfs_debug_dumplog+0x2b0/0x320
[libcfs]
[57940.076545]                                 sp=e00000000afb7d30
bsp=e00000000afb1190
[57940.076583]  [<a00000020027b670>] lbug_with_loc+0x70/0xe0 [libcfs]
[57940.076586]                                 sp=e00000000afb7d60
bsp=e00000000afb1160
[57940.076622]  [<a00000020028c160>] libcfs_assertion_failed+0xa0/0xc0
[libcfs]
[57940.076625]                                 sp=e00000000afb7d60
bsp=e00000000afb1128
[57940.076672]  [<a0000002003b51a0>]
ksocknal_close_conn_locked+0x80/0x5a0 [ksocklnd]
[57940.076674]                                 sp=e00000000afb7d60
bsp=e00000000afb10c0
[57940.076704]  [<a0000002003bc6d0>]
ksocknal_close_peer_conns_locked+0x90/0xe0 [ksocklnd]
[57940.076707]                                 sp=e00000000afb7d60
bsp=e00000000afb1080
[57940.076736]  [<a0000002003bc790>]
ksocknal_close_conn_and_siblings+0x70/0xc0 [ksocklnd]
[57940.076739]                                 sp=e00000000afb7d60
bsp=e00000000afb1040
[57940.076771]  [<a0000002003c9e60>]
ksocknal_process_receive+0x640/0xae0 [ksocklnd]
[57940.076774]                                 sp=e00000000afb7d60
bsp=e00000000afb1000
[57940.076802]  [<a0000002003cacc0>] ksocknal_scheduler+0x4a0/0xe80
[ksocklnd]
[57940.076805]                                 sp=e00000000afb7da0
bsp=e00000000afb0f88
[57940.076824]  [<a00000010001d790>] kernel_thread_helper+0x30/0x60
[57940.076826]                                 sp=e00000000afb7e30
bsp=e00000000afb0f60
[57940.076846]  [<a0000001000110c0>] start_kernel_thread+0x20/0x40
[57940.076849]                                 sp=e00000000afb7e30
bsp=e00000000afb0f60
[57949.451304] BUG: soft lockup detected on CPU#0!
[57949.451340] Modules linked in: fsfilt_ldiskfs ldiskfs mds lov osc mdc
ptlrpc obdclass lvfs ksocklnd lnet libcfs ipv6 autofs4 sunrpc af_packet
dm_snapshot dm_zero dm_mirror ext3 mbcache jbd dm_mod ide_disk cmd64x
ide_core mptsas mptspi mptfc scsi_transport_fc mptscsih sym53c8xx
mptbase sd_mod
[57949.451466]
[57949.451468] Pid: 3896, CPU 0, comm:       socknal_reaper
[57949.451480] psr : 00000210081a6010 ifs : 8000000000000004 ip  :
[<a0000001004e57c1>]    Not tainted
[57949.451497] ip is at _write_lock_bh+0x41/0xa0
[57949.451505] unat: 0000000000000000 pfs : 800000000000050e rsc :
000000000000000b
[57949.451513] rnat: 0000000000000000 bsps: 0000000000000000 pr  :
0000000000006681
[57949.451521] ldrs: 0000000000000000 ccv : 0000000000000000 fpsr:
0009804c8a70433f
[57949.451529] csd : 0000000000000000 ssd : 0000000000000000
[57949.451539] b0  : a0000002003bb980 b6  : a000000100348ea0 b7  :
a000000100068e30
[57949.451549] f6  : 1003e0000000000000028 f7  : 1003e28f5c28f5c28f5c3
[57949.451557] f8  : 1003e00000000000000fa f9  : 1003e0000000032000000
[57949.451567] f10 : 1003e000000003b9aca00 f11 : 1003ed6bf94d5e57a42bd
[57949.451577] r1  : a00000010092b5a0 r2  : 0000000080000000 r3  :
a00000010073cbf0
[57949.451587] r8  : 0000000000000000 r9  : fffffffffff00001 r10 :
fffffffffff04c18
[57949.451598] r11 : fffffffffff00000 r12 : e0000000093efdf0 r13 :
e0000000093e8000
[57949.451608] r14 : e0000000093e8f10 r15 : 0000000000000100 r16 :
e0000000093e8f10
[57949.451617] r17 : 0000000000000000 r18 : 0000000000000001 r19 :
000000003fffff00
[57949.451626] r20 : 0000000000000100 r21 : 0000000000000000 r22 :
ffffffffffff0048
[57949.451634] r23 : e0000000093e8f10 r24 : 0000000000000000 r25 :
0000000000000001
[57949.451644] r26 : a0000002003e0ec8 r27 : 0000000000000000 r28 :
a0000002003e0e90
[57949.451654] r29 : 0000000080000000 r30 : 0000000000000000 r31 :
e00000000d433700
[57949.451668]
[57949.451670] Call Trace:
[57949.451688]  [<a00000010001b490>] show_stack+0x50/0xa0
[57949.451691]                                 sp=e0000000093efa40
bsp=e0000000093e9288
[57949.451722]  [<a00000010001bd60>] show_regs+0x820/0x840
[57949.451724]                                 sp=e0000000093efc10
bsp=e0000000093e9240
[57949.451751]  [<a0000001000d5e90>] softlockup_tick+0x150/0x180
[57949.451753]                                 sp=e0000000093efc10
bsp=e0000000093e9210
[57949.451776]  [<a0000001000a2f30>] do_timer+0x990/0x9c0
[57949.451778]                                 sp=e0000000093efc20
bsp=e0000000093e91c0
[57949.451802]  [<a0000001000411e0>] timer_interrupt+0x200/0x3c0
[57949.451804]                                 sp=e0000000093efc20
bsp=e0000000093e9180
[57949.451824]  [<a0000001000d65f0>] handle_IRQ_event+0x170/0x240
[57949.451827]                                 sp=e0000000093efc20
bsp=e0000000093e9140
[57949.451846]  [<a0000001000d6980>] __do_IRQ+0x2c0/0x3e0
[57949.451849]                                 sp=e0000000093efc20
bsp=e0000000093e90f0
[57949.451875]  [<a000000100341de0>] evtchn_do_upcall+0x160/0x240
[57949.451877]                                 sp=e0000000093efc20
bsp=e0000000093e9068
[57949.451899]  [<a000000100068ce0>] xen_event_callback+0x3a0/0x3e0
[57949.451902]                                 sp=e0000000093efc20
bsp=e0000000093e9068

Any help on this would be appreciated. I'm trying to build 1.4.8 now
to see if any of the old 1.4.7.3 patches I had still need to be
applied.

Thanks,
David Brown

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to