Re: [Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

Andreas Dilger Fri, 16 Jul 2010 08:48:05 -0700

The use of ext3 or ext4 and the filesystem feature flags has nothing to do with 
the setting of the incorrect target. I don't know how you got to that state, 
but there are a number of places where the OST index is stored that need to 
verified and fixed.


There is the mountdata file, which you have already found. There is the 
filesystem label, which you can view and change with the e2label command. 

There is also the last_rcvd file that has the OST UUID at the start and the ost 
index as one of it's fields. Normally I would just say to delete this file, 
since it can be recreated at mount time, but since the OST already has an 
identity crisis I'm not sure it would get it right. 

You should fire up a binary editor to change the UUID in last_rcvd and look at 
the lsd_index field in struct lustre_server_data (which is what is stored at 
the beginning of the last_rcvd file). 

Cheers, Andreas

On 2010-07-16, at 7:43, Roger Sersted <[email protected]> wrote:

> 
> I didn't find the hack anywhere.  I looked at what those files contained and 
> decided to "hack and slash".  Apparently, those files are generated from data 
> within the filesystem system itself.  A second running of writeconf displayed 
> the target value to be "lustre1-OST0000", which is what I didn't want. :-(
> 
> 
> Roger S.
> 
> Wojciech Turek wrote:
>> Hi Roger
>> 
>> Where did you find this CONFIG hack?
>> Did you make a copy of the CONFIG dir before followed this steps?
>> 
>> 
>> 
>> On 15 July 2010 20:02, Roger Sersted <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> 
>>    I am using the ext4 RPMs.  I ran the following commands on the MDS
>>    and OSS nodes (lustre was not running at the time):
>> 
>> 
>>           tune2fs -O extents,uninit_bg,dir_index /dev/XXX
>>           fsck -pf /dev/XXX
>> 
>>    I then started Lustre "mount -t lustre /dev/XXX /lustre" on the
>>    OSSes and then the MDS.  The problem still persisted. I then
>>    shutdown Lustre by unmounting the Lustre filesystems on the MDS/OSS
>>    nodes.
>> 
>>    My last and most desperate step was to "hack" the CONFIG files.  On
>>    puppy7, I did the following:
>> 
>>           1. mount -t ldiskfs /dev/sdc /mnt
>>           2. cd /mnt/CONFIG
>>           3. mv lustre1-OST0000 lustre1-OST0001
>>           4. vim -nb lustre1-OST0001 mountdata
>>           5. I changed OST0000 to OST0001.
>>           6. I verified my changes by comparing an "od -c" of before
>>    and after.
>>           7. umount /mnt
>>           8. tunefs.lustre -writeconf /dev/sdc
>> 
>>    The output of step 8 is:
>> 
>>     tunefs.lustre -writeconf /dev/sdc
>> 
>>    checking for existing Lustre data: found CONFIGS/mountdata
>>    Reading CONFIGS/mountdata
>> 
>>      Read previous values:
>>    Target:     lustre1-OST0001
>> 
>>    Index:      0
>>    Lustre FS:  lustre1
>>    Mount type: ldiskfs
>>    Flags:      0x102
>>                 (OST writeconf )
>> 
>>    Persistent mount opts: errors=remount-ro,extents,mballoc
>>    Parameters: mgsnode=172.17....@o2ib
>> 
>> 
>>      Permanent disk data:
>>    Target:     lustre1-OST0000
>>    Index:      0
>>    Lustre FS:  lustre1
>>    Mount type: ldiskfs
>>    Flags:      0x102
>>                 (OST writeconf )
>> 
>>    Persistent mount opts: errors=remount-ro,extents,mballoc
>>    Parameters: mgsnode=172.17....@o2ib
>> 
>>    Writing CONFIGS/mountdata
>> 
>>    Now part of the system seems to have the correct Target value.
>> 
>>    Thanks for your time on this.
>> 
>>    Roger S.
>> 
>>    Wojciech Turek wrote:
>> 
>>        Hi Roger,
>> 
>>        the Lustre 1.8.3 for RHEL5 has to set of RPMS one set for old
>>        style ext3 based ldiskfs and one set for the ext4 based ldiskfs.
>>        When upgrading from 1.6.6 to 1.8.3 I think you should not try to
>>        use the ext4 based packages, can you let us know which RPMs have
>>        you used?
>> 
>> 
>> 
>>        On 15 July 2010 16:14, Roger Sersted <[email protected]
>>        <mailto:[email protected]> <mailto:[email protected]
>>        <mailto:[email protected]>>> wrote:
>> 
>> 
>> 
>>           Wojciech Turek wrote:
>> 
>>               can you also please post output of  'rpm -qa | grep
>>        lustre' run
>>               on puppy5-7 ?
>> 
>> 
>> 
>>           [r...@puppy5 log]# rpm -qa |grep -i lustre
>>           kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>>           lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>>           lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>>           mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>>           lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>> 
>>           [r...@puppy6 log]# rpm -qa | grep -i lustre
>>           kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>>           lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>>           lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>>           mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>>           lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>> 
>>           [r...@puppy7 CONFIGS]# rpm -qa | grep -i lustre
>>           kernel-2.6.18-164.11.1.el5_lustre.1.8.3
>>           lustre-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>>           lustre-ldiskfs-3.0.9-2.6.18_164.11.1.el5_lustre.1.8.3
>>           mft-2.6.0-2.6.18_164.11.1.el5_lustre.1.8.3
>>           lustre-modules-1.8.3-2.6.18_164.11.1.el5_lustre.1.8.3
>> 
>>           Thanks,
>> 
>>           Roger S.
>> 
>> 
>>               On 15 July 2010 15:55, Roger Sersted <[email protected]
>>        <mailto:[email protected]>
>>               <mailto:[email protected] <mailto:[email protected]>>
>>        <mailto:[email protected] <mailto:[email protected]>
>>               <mailto:[email protected] <mailto:[email protected]>>>> wrote:
>> 
>> 
>>                  OK.  This looks bad.  It appears that I should have
>>        upgraded
>>               ext3 to
>>                  ext4, I found instructions for that,
>> 
>>                         tune2fs -O extents,uninit_bg,dir_index /dev/XXX
>>                         fsck -pf /dev/XXX
>>                             Is the above correct?  I'd like to move our
>>               systems to ext4. I
>>                  didn't know those steps were necessary.
>> 
>>                  Other answers listed below.
>> 
>> 
>>                  Wojciech Turek wrote:
>> 
>>                      Hi Roger,
>> 
>>                      Sorry for the delay. From the ldiskfs messages I
>>        seem to
>>               me that
>>                      you are using ext4 ldiskfs
>>                      (Jun 26 17:54:30 puppy7 kernel: ldiskfs created from
>>                      ext4-2.6-rhel5).
>>                      If you upgrading from 1.6.6 you ldiskfs is ext3
>>        based so
>>               I think
>>                      taht in lustre-1.8.3 you should use ext3 based
>>        ldiskfs rpm.
>> 
>>                      Can you also  tell us a bit more about your setup?
>>        From
>>               what you
>>                      wrote so far I understand you have 2 OSS servers
>>        and each
>>               server
>>                      has one OST device. In addition to that you have a
>>        third
>>               server
>>                      which acts as a MGS/MDS, is that right?
>> 
>>                      The logs you provided seem to be only from one
>>        server called
>>                      puppy7 so it does not give a whole picture of the
>>               situation. The
>>                      timeout messages may indicate a problem with
>>        communication
>>                      between the servers but it is really difficult to
>>        say without
>>                      seeing the whole picture or at least more elements
>>        of it.
>> 
>>                      To check if you have correct rpms installed can you
>>               please run
>>                      'rpm -qa | grep lustre' on both OSS servers and
>>        the MDS?
>> 
>>                      Also please provide output from command 'lctl
>>        list_nids'
>>                run on
>>                      both OSS servers, MDS and a client?
>> 
>> 
>>                  puppy5 (MDS/MGS)
>> 
>>                  172.17....@o2ib
>>                  172.16....@tcp
>> 
>>                  puppy6 (OSS)
>>                  172.17....@o2ib
>>                  172.16....@tcp
>> 
>>                  puppy7 (OSS)
>>                  172.17....@o2ib
>>                  172.16....@tcp
>> 
>> 
>> 
>> 
>>                      In addition to above please run following command
>>        on all
>>               lustre
>>                      targets (OSTs and MDT) to display your current lustre
>>               configuration
>> 
>>                       tunefs.lustre --dryrun --print /dev/<ost_device>
>> 
>> 
>>                  puppy5 (MDS/MGS)
>>                    Read previous values:
>>                  Target:     lustre1-MDT0000
>>                  Index:      0
>>                  Lustre FS:  lustre1
>>                  Mount type: ldiskfs
>>                  Flags:      0x405
>>                               (MDT MGS )
>>                  Persistent mount opts:
>>        errors=remount-ro,iopen_nopriv,user_xattr
>>                  Parameters: lov.stripesize=125K lov.stripecount=2
>>                  mdt.group_upcall=/usr/sbin/l_getgroups
>>        mdt.group_upcall=NONE
>>                  mdt.group_upcall=NONE
>> 
>> 
>>                    Permanent disk data:
>>                  Target:     lustre1-MDT0000
>>                  Index:      0
>>                  Lustre FS:  lustre1
>>                  Mount type: ldiskfs
>>                  Flags:      0x405
>>                               (MDT MGS )
>>                  Persistent mount opts:
>>        errors=remount-ro,iopen_nopriv,user_xattr
>>                  Parameters: lov.stripesize=125K lov.stripecount=2
>>                  mdt.group_upcall=/usr/sbin/l_getgroups
>>        mdt.group_upcall=NONE
>>                  mdt.group_upcall=NONE
>> 
>>                  exiting before disk write.
>>                  ----------------------------------------------------
>>                  puppy6
>>                  checking for existing Lustre data: found CONFIGS/mountdata
>>                  Reading CONFIGS/mountdata
>> 
>>                    Read previous values:
>>                  Target:     lustre1-OST0000
>>                  Index:      0
>>                  Lustre FS:  lustre1
>>                  Mount type: ldiskfs
>>                  Flags:      0x2
>>                               (OST )
>>                  Persistent mount opts: errors=remount-ro,extents,mballoc
>>                  Parameters: mgsnode=172.17....@o2ib
>> 
>> 
>>                    Permanent disk data:
>>                  Target:     lustre1-OST0000
>>                  Index:      0
>>                  Lustre FS:  lustre1
>>                  Mount type: ldiskfs
>>                  Flags:      0x2
>>                               (OST )
>>                  Persistent mount opts: errors=remount-ro,extents,mballoc
>>                  Parameters: mgsnode=172.17....@o2ib
>>                  --------------------------------------------------
>>                  puppy7 (this is the broken OSS. The "Target" should be
>>                  "lustre1-OST0001")
>>                  checking for existing Lustre data: found CONFIGS/mountdata
>>                  Reading CONFIGS/mountdata
>> 
>>                    Read previous values:
>>                  Target:     lustre1-OST0000
>>                  Index:      0
>>                  Lustre FS:  lustre1
>>                  Mount type: ldiskfs
>>                  Flags:      0x2
>>                               (OST )
>>                  Persistent mount opts: errors=remount-ro,extents,mballoc
>>                  Parameters: mgsnode=172.17....@o2ib
>> 
>> 
>>                    Permanent disk data:
>>                  Target:     lustre1-OST0000
>>                  Index:      0
>>                  Lustre FS:  lustre1
>>                  Mount type: ldiskfs
>>                  Flags:      0x2
>>                               (OST )
>>                  Persistent mount opts: errors=remount-ro,extents,mballoc
>>                  Parameters: mgsnode=172.17....@o2ib
>> 
>>                  exiting before disk write.
>> 
>> 
>> 
>>                      If possible please attach syslog from each machine
>>        from
>>               the time
>>                      you mounted lustre targets (OST and MDT).
>> 
>>                      Best regards,
>> 
>>                      Wojciech
>> 
>>                      On 14 July 2010 20:46, Roger Sersted
>>        <[email protected] <mailto:[email protected]>
>>               <mailto:[email protected] <mailto:[email protected]>>
>>                      <mailto:[email protected] <mailto:[email protected]>
>>        <mailto:[email protected] <mailto:[email protected]>>>
>>               <mailto:[email protected] <mailto:[email protected]>
>>        <mailto:[email protected] <mailto:[email protected]>>
>> 
>>                      <mailto:[email protected] <mailto:[email protected]>
>>        <mailto:[email protected] <mailto:[email protected]>>>>> wrote:
>> 
>> 
>>                         Any additional info?
>> 
>>                         Thanks,
>> 
>>                         Roger S.
>> 
>> 
>> 
>> 
>>                      --         --
>>                      Wojciech Turek
>> 
>> 
>> 
>> 
>> 
>>               --         --
>>               Wojciech Turek
>> 
>>               Assistant System Manager
>>               517
>> 
>> 
>> 
>> 
>> -- 
> _______________________________________________
> Lustre-discuss mailing list
> [email protected]
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] 1.6.6 to 1.8.3 upgrade, OSS with wrong "Target" value

Reply via email to