Nathan,
I started out with IP addresses of 10.2.9.1 (MDS), 10.2.9.2 (standby MDS), 10.2.9.3 (OSS), and 10.2.9.4 (peer OSS). I created a single MDT and a single OST, using the following commands: MDS# mkfs.lustre --reformat --fsname hss2 --device-size=10000 --mgs --mdt --mkfsoptions=' -O extents,dir_index,uninit_groups' --mgsnode=10.2....@o2ib0 /dev/mapper/map0 OSS# mkfs.lustre --reformat --ost --index=0 --mkfsoptions=' -O extents,dir_index,uninit_groups ' --fsname hss2 --device-size=100000 --mgsnode=10.2....@o2ib0 /dev/mapper/map0 I mounted, mounted a client, created a few files, then unmounted the client, unmounted the servers, rebooted the clients and servers. Once the servers were back up, I ran the following on the MDS and OSS, respectively: MDS# tunefs.lustre --erase-param --mgsnode=10.2.9....@o2ib0 --failnode=10.2.9....@o2ib0 /dev/mapper/map0 OSS# tunefs.lustre --erase-param --failnode=10.2.9....@o2ib0 --mgsnode=10.2.9....@o2ib0 --mgsnode=10.2.9....@o2ib0 /dev/mapper/map0 Then, I removed last_rcvd from the MDT and OST. The, I changed the IP address to 10.2.9.201 (MDS), 10.2.9.202 (standby MDS), 10.2.9.203 (OSS), 10.2.9.204 (peer OSS). I mounted the MDT and OST. After a short while, I got the following errors on the MDS: Lustre: 4567:0:(client.c:1464:ptlrpc_expire_one_request()) @@@ Request x1343087831941136 sent from hss2-OST0000-osc to NID 10.2.9....@o2ib 0s ago has failed due to network error (5s prior to deadline). r...@ffff810213b5e400 x1343087831941136/t0 o8->[email protected]@o2ib:28/4 lens 368/584 e 0 to 1 dl 1280868405 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 4568:0:(import.c:517:import_select_connection()) hss2-OST0000-osc: tried all connections, increasing latency to 1s Lustre: 4567:0:(client.c:1464:ptlrpc_expire_one_request()) @@@ Request x1343087831941137 sent from hss2-OST0000-osc to NID 10.2....@o2ib 6s ago has timed out (6s prior to d eadline). r...@ffff810213b5e400 x1343087831941137/t0 o8->[email protected]@o2ib:28/4 lens 368/584 e 0 to 1 dl 1280868412 ref 2 fl Rpc:N/0/0 rc 0/0 LustreError: 4567:0:(lib-move.c:2441:LNetPut()) Error sending PUT to 12345-10.2.9....@o2ib: -113 Note that the old IP address of the old OST (10.2.9.203) is still listed. How can I change that? The client is also seeing old IP addresses, this time the MDS's 10.2.9.1: Lustre: Request x55 sent from hss2-MDT0000-mdc-ffff81007981d800 to NID 10.2....@o2ib 5s ago has timed out (limit 5s). Lustre: Skipped 9 previous similar messages Lustre: 6433:0:(import.c:507:import_select_connection()) hss2-MDT0000-mdc-ffff81007981d800: tried all connections, increasing latency to 50s Lustre: 6433:0:(import.c:507:import_select_connection()) Skipped 4 previous similar messages Any help is appreciated. Thanks. -Roger ________________________________ From: Roger Spellman Sent: Tuesday, August 03, 2010 4:22 PM To: 'Nathan Rutman' Cc: [email protected] Subject: RE: [Lustre-discuss] Problem with write_conf Nathan, Thanks. That works great. Are there any tricks involved in also making a non-redundant system redundant at the same time? E.g. Can I just do: MDS# tunefs.lustre --erase-param --mgsnode=10.2.9....@o2ib0 --failnode=10.2.9....@o2ib0 /dev/mapper/map0 OSS# tunefs.lustre --erase-param --failnode=10.2.9....@o2ib0 --mgsnode=10.2.9....@o2ib0 --mgsnode=10.2.9....@o2ib0 /dev/mapper/map0 Is the OSS's NID stored anywhere on the OST? -Roger ________________________________ From: Nathan Rutman [mailto:[email protected]] Sent: Tuesday, August 03, 2010 4:05 PM To: Roger Spellman Cc: [email protected] Subject: Re: [Lustre-discuss] Problem with write_conf On Aug 3, 2010, at 12:49 PM, Roger Spellman wrote: If I change the NIDs, and if I don't remove /mnt/mdt/CONFIGS/*-client, then I get the following when I try mounting a client (note that 10.2.9.1 is the OLD address): mount.lustre: mount 10.2....@o2ib:/hss2 at /mnt/lustre-hss2 failed: Cannot send after transport endpoint shutdown Don't mount with the old address :) This is not contained in the config log; this is the MGS address the client needs to talk to to GET the config log. It needs to point to the current IP of the MGS. Maybe you've stuck this in /etc/fstab or perhaps your DNS name resolution of the MGS's common name hasn't been updated.
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
