What an adventure this turned into. Turns out when I had to relabel the physical volumes, I got two of them backwards(realized this when I checked /proc/fs/luster/devices) and somehow this was tripping things up. I swapped them back using pvremove and pvcreate, remounted and after a few minutes, the clients reconnected and the system is happy again.
On Mon, Jul 2, 2012 at 12:42 AM, David Noriega <[email protected]> wrote: > Sorry for the rushed email. For some reason the LVM metadata got > screwed up, managed to restore it, though now running into another > issue. I've mounted the OSTs yet it seems they are not all > cooperating. One of the OSTs will stay listed as Resource Unavailable > and this seems to be the main message on the OSS node: > > LustreError: 137-5: UUID 'lustre-OST0002_UUID' is not available for > connect (no target) > LustreError: Skipped 470 previous similar messages > LustreError: 5214:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ > processing error (-19) req@ffff8103ffc73400 x1404513746630678/t0 > o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1341207057 ref 1 fl > Interpret:/0/0 rc -19/0 > LustreError: 5214:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped > 470 previous similar messages > > I've tried remounting this ost on the other data node but still won't > connect from the client side. I've even rebooted the mds and still no > go. I've run e2fsck to check the OSTs and no issues and the disk > arrays report no problems on their end and fibre connections are good > and the multipath driver doesnt report anything(These are Sun disk > arrays so using the rdac driver instead of the basic multpath daemon). > > On the client side I'll see this: > Lustre: 3289:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request > x1404591888147958 sent from lustre-OST0002-osc-ffff8104104ad800 to NID > 192.168.5.101@tcp 0s ago has failed due to network error (30s prior to > deadline). > req@ffff81015113b400 x1404591888147958/t0 > o8->[email protected]@tcp:28/4 lens 368/584 e 0 to 1 > dl 1341187631 ref 1 fl Rpc:N/0/0 rc 0/0 > > Lustre: 3290:0:(import.c:517:import_select_connection()) > lustre-OST0002-osc-ffff8104104ad800: tried all connections, increasing > latency to 22s > Lustre: 3290:0:(import.c:517:import_select_connection()) Skipped 39 > previous similar messages > > > On Sun, Jul 1, 2012 at 8:10 PM, Mark Day <[email protected]> wrote: >> Does the device show up in /dev ? >> Have you physically checked for Fibre/SAS connectivity, RAID controller >> errors etc? >> >> You may need to supply more information about your setup. It sounds more >> like a RAID/disk issue than a Lustre issue. >> >> ________________________________ >> From: "David Noriega" <[email protected]> >> To: [email protected] >> Sent: Monday, 2 July, 2012 8:51:18 AM >> Subject: [Lustre-discuss] Lustre missing physical volume >> >> >> Just recently used heartbeat to failover resources so that I could >> power down a lustre node to add more ram and failed back to do the >> same to our second lustre node. Only then do I find that now our >> lustre install is missing a physical volume out of lvm. pvscan only >> shows three out of four partitions. >> >> Any hints? I've tried some recovery steps in lvm with pvcreate using >> the archived config for the missing pv but no luck, says no device >> with such uuid. I'm lost on what to do now. This is lustre 1.8.4 >> _______________________________________________ >> Lustre-discuss mailing list >> [email protected] >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > > > -- > David Noriega > CSBC/CBI System Administrator > University of Texas at San Antonio > One UTSA Circle > San Antonio, TX 78249 > Office: BSE 3.112 > Phone: 210-458-7100 > http://www.cbi.utsa.edu -- David Noriega CSBC/CBI System Administrator University of Texas at San Antonio One UTSA Circle San Antonio, TX 78249 Office: BSE 3.112 Phone: 210-458-7100 http://www.cbi.utsa.edu _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
