A bit more information. /dev/sde on lmd01 is an iSCSI volume. Made a snapshot of it, mounted it "-t ldiskfs", and find the following information in place at the mount point:
r...@lmd01 /mnt/tmp# ll total 128 -rwx------ 1 root root 1792 Apr 20 2010 CATALOGS drwxr-xr-x 2 root root 4096 Dec 13 15:55 CONFIGS -rw-r--r-- 1 root root 4096 Apr 20 2010 health_check -rw-r--r-- 1 root root 56064 Dec 3 18:26 last_rcvd drwxrwxrwx 2 root root 4096 Apr 20 2010 LOGS drwx------ 2 root root 16384 Apr 20 2010 lost+found -rw-r--r-- 1 root root 448 Sep 10 12:17 lov_objid drwxrwxrwx 2 root root 20480 Dec 13 23:57 OBJECTS drwxrwxrwx 2 root root 12288 Dec 15 11:31 PENDING drwxr-xr-x 19 root root 4096 Dec 6 06:28 ROOT r...@lmd01 /mnt/tmp# ll CONFIGS total 76 -rw-r--r-- 1 root root 12288 May 21 2010 mountdata -rw-r--r-- 1 root root 61944 Dec 13 15:55 umt3-MDT0000 bob On 12/15/2010 1:33 PM, Bob Ball wrote: > And, the hole gets deeper. I was digging in the list archives, and in > the manual, and decided to look at what was stored in the file systems > using "tunefs.lustre --print". > > The mgs machine is fine: > [mgs:~]# tunefs.lustre --print /dev/sdb > checking for existing Lustre data: found CONFIGS/mountdata > Reading CONFIGS/mountdata > > Read previous values: > Target: MGS > ... > > Individual OSS and their OST are fine: > [r...@umfs05 ~]# tunefs.lustre --print /dev/sdf > checking for existing Lustre data: found CONFIGS/mountdata > Reading CONFIGS/mountdata > > Read previous values: > Target: umt3-OST0000 > ... > > But, on the MDT, not so fine: > r...@lmd01 ~# tunefs.lustre --print /dev/sde > checking for existing Lustre data: not found > > tunefs.lustre FATAL: Device /dev/sde has not been formatted with mkfs.lustre > tunefs.lustre: exiting with 19 (No such device) > > This is, of course, not true. The partition was once upon a time > formatted this way, but somehow, over time and untrackable operations, > this history was lost. So, before I can begin to deal with the issue > below, it seems this little issue needs to be addressed. However, I > have no idea where to begin with this. As we have 2 MDT machines, > originally set up as HA fail-overs, I guess it is possible this would > work fine if the MDT were mounted on its twin at 10.10.1.49 instead of > on this machine at 10.10.1.48? > > Can someone suggest a workable path to resolve this? I have not (yet) > taken the MDT offline to remount as ldiskfs and look at details. > > bob > > On 12/14/2010 5:12 PM, Kevin Van Maren wrote: >> The clients (and servers) get the list of NIDs for each mdt/ost device >> from the MGS at mount time. >> >> Having the clients fail to connect to 10.10.1.49 is _expected_ when >> the service is failed over >> to 10.10.1.48. However, they should succeed in connecting to >> 10.10.1.48 and then you should >> no longer get complaints about 10.10.1.49. >> >> If the clients are not failing over to 10.10.1.48, then you might not >> have the failover NID >> properly specified to allow failover. Are you sure you properly >> specified the failover parameters >> during mkfs on the MDT and did the first mount from the correct machine? >> >> If the NIDs are wrong, it is possible to correct it using >> --writeconf. See the manual (or search >> the list archives). >> >> Kevin >> >> >> Bob Ball wrote: >>> OK, so, we rebooted 10.10.1.49 into a different, non-lustre kernel. >>> Then, to be as certain as I could be that the client did not know >>> about 10.10.1.49, I rebooted it as well. After it was fully up (with >>> the lustre file system mount in /etc/fstab) I umounted it, then >>> mounted again as below. And, the message still came back that it was >>> trying to contact 10.10.1.49 instead of 10.10.1.48 as it should. To >>> repeat, the dmesg is logging: >>> >>> Lustre: mgc10.10.1....@tcp: Reactivating import >>> Lustre: 10523:0:(obd_mount.c:1786:lustre_check_exclusion()) Excluding >>> umt3-OST0019 (on exclusion list) >>> Lustre: 5936:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ >>> Request x1355139761832543 sent from umt3-MDT0000-mdc-ffff81062c82c400 >>> to NID 10.10.1...@tcp 0s ago has failed due to network error (5s >>> prior to deadline). >>> r...@ffff81060e4ebc00 x1355139761832543/t0 >>> o38->[email protected]@tcp:12/10 lens 368/584 e 0 to 1 dl >>> 1292362202 ref 1 fl Rpc:N/0/0 rc 0/0 >>> Lustre: Client umt3-client has started >>> >>> I guess I need to know why, in the world, is this client still trying >>> to access 10.10.1.49? Is there something, perhaps, on the MGC >>> machine that is causing this mis-direct? What? And, most >>> importantly, how do I fix this? >>> >>> bob >>> >>> On 12/14/2010 3:05 PM, Bob Ball wrote: >>>> Well, you are absolutely right, it is a timeout talking to what it >>>> THINKS is the MDT. The thing is, it is NOT! >>>> >>>> We were set up for HA for the MDT, with 10.10.1.48 and 10.10.1.49 >>>> watching and talking to one another. The RedHat service was >>>> problematic, so right now 10.10.1.48 is the MDT, and has /mnt/mdt >>>> mounted, and 10.10.1.49 is being used to do backups, and has >>>> /mnt/mdt_snapshot mounted. The actual volume is an iSCSI location. >>>> >>>> So, somehow, the client node has found and is talking to the wrong >>>> host! Not good. Scary. Got to do something about this..... >>>> >>>> Suggestions appreciated.... >>>> >>>> bob >>>> >>>> On 12/14/2010 11:57 AM, Andreas Dilger wrote: >>>>> The error message shows a timeout connecting to umt3-MDT0000 and >>>>> not the OST. The operation 38 is MDS_CONNECT, AFAIK. >>>>> >>>>> Cheers, Andreas >>>>> >>>>> On 2010-12-14, at 9:19, Bob Ball<[email protected]> wrote: >>>>> >>>>>> I am trying to get a lustre client to mount the service, but with >>>>>> one or >>>>>> more OST disabled. This does not appear to be working. Lustre >>>>>> version >>>>>> is 1.8.4. >>>>>> >>>>>> mount -o localflock,exclude=umt3-OST0019 -t lustre >>>>>> 10.10.1....@tcp0:/umt3 /lustre/umt3 >>>>>> >>>>>> dmesg on this client shows the following during the umount/mount >>>>>> sequence: >>>>>> >>>>>> Lustre: client ffff810c25c03800 umount complete >>>>>> Lustre: Skipped 1 previous similar message >>>>>> Lustre: mgc10.10.1....@tcp: Reactivating import >>>>>> Lustre: 450250:0:(obd_mount.c:1786:lustre_check_exclusion()) >>>>>> Excluding >>>>>> umt3-OST0019 (on exclusion list) >>>>>> Lustre: 450250:0:(obd_mount.c:1786:lustre_check_exclusion()) >>>>>> Skipped 1 >>>>>> previous similar message >>>>>> Lustre: 5942:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ >>>>>> Request >>>>>> x1354682302740498 sent from umt3-MDT0000-mdc-ffff810628209000 to NID >>>>>> 10.10.1...@tcp 0s ago has failed due to network error (5s prior to >>>>>> deadline). >>>>>> r...@ffff810620e66400 x1354682302740498/t0 >>>>>> o38->[email protected]@tcp:12/10 lens 368/584 e 0 to 1 dl >>>>>> 1292342239 ref 1 fl Rpc:N/0/0 rc 0/0 >>>>>> Lustre: 5942:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1 >>>>>> previous similar message >>>>>> Lustre: Client umt3-client has started >>>>>> >>>>>> When I check following the mount, using "lctl dl", I see the >>>>>> following, >>>>>> and it is clear that the OST is active as far as this client is >>>>>> concerned. >>>>>> >>>>>> 19 UP osc umt3-OST0018-osc-ffff810628209000 >>>>>> 05b29472-d125-c36e-c023-e0eb76aaf353 5 >>>>>> 20 UP osc umt3-OST0019-osc-ffff810628209000 >>>>>> 05b29472-d125-c36e-c023-e0eb76aaf353 5 >>>>>> 21 UP osc umt3-OST001a-osc-ffff810628209000 >>>>>> 05b29472-d125-c36e-c023-e0eb76aaf353 5 >>>>>> >>>>>> Two questions here. The first, obviously, is what is wrong with this >>>>>> picture? Why can't I exclude this OST from activity on this >>>>>> client? Is >>>>>> it because the OSS serving that OST still has the OST active? If the >>>>>> OST were deactivated or otherwise unavailable on the OSS, would the >>>>>> client mount then succeed to exclude this OST? (OK, more than one >>>>>> question in the group....) >>>>>> >>>>>> Second group, what is the correct syntax for excluding more that one >>>>>> OST? Is it a comma-separated list of exclusions, or are separate >>>>>> excludes required? >>>>>> >>>>>> mount -o localflock,exclude=umt3-OST0019,umt3-OST0020 -t lustre >>>>>> 10.10.1....@tcp0:/umt3/lustre/umt3 >>>>>> or >>>>>> mount -o localflock,exclude=umt3-OST0019,exclude=umt3-OST0020 -t >>>>>> lustre 10.10.1....@tcp0:/umt3 /lustre/umt3 >>>>>> >>>>>> Thanks, >>>>>> bob >>>>>> _______________________________________________ >>>>>> Lustre-discuss mailing list >>>>>> [email protected] >>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> [email protected] >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> [email protected] >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
