Hi Lustre Community,

I'm asking for help with a severe Lustre OST issue after backend disk array 
failure.



The storage array failure caused 4 out of 6 OSTs offline.
After e2fsck repair, 5 OSTs recovered and work normally.
Only dybfs2-OST0003(device /dev/mapper/mpathd) fails to mount with error -17 
File exists.

May 11 14:54:40 dybfs16 kernel: LustreError: 
3663:0:(osd_oi.c:762:osd_oi_insert()) dm-2: the FID [0x200000003:0x3:0x0] is 
used by two objects: 3080193/2625393380 35842/449209919

May 11 14:54:40 dybfs16 kernel: LustreError: 
3663:0:(obd_config.c:559:class_setup()) setup dybfs2-OST0003 failed (-17)

May 11 14:54:40 dybfs16 kernel: LustreError: 
3663:0:(obd_config.c:1835:class_config_llog_handler()) MGC192.168.50.23@tcp: 
cfg command failed: rc = -17

May 11 14:54:40 dybfs16 kernel: Lustre:    cmd=cf003 0:dybfs2-OST0003  1:dev  
2:0  3:f  

May 11 14:54:40 dybfs16 kernel: LustreError: 15c-8: MGC192.168.50.23@tcp: The 
configuration from log 'dybfs2-OST0003' failed (-17). This may be the result of 
communication errors between 

this node and the MGS, a bad configuration, or other errors. See the syslog for 
more information.

May 11 14:54:40 dybfs16 kernel: LustreError: 
3526:0:(obd_mount_server.c:1397:server_start_targets()) failed to start server 
dybfs2-OST0003: -17

May 11 14:54:40 dybfs16 kernel: LustreError: 
3526:0:(obd_mount_server.c:1992:server_fill_super()) Unable to start targets: 
-17

May 11 14:54:40 dybfs16 kernel: LustreError: 
3526:0:(obd_config.c:610:class_cleanup()) Device 3 not setup

May 11 14:54:40 dybfs16 kernel: Lustre: server umount dybfs2-OST0003 complete

May 11 14:54:40 dybfs16 kernel: LustreError: 
3526:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount 
/dev/mapper/mpathd (-17)



Sample kernel errors:

plaintext
May 11 12:23:16 dybfs16 kernel: LustreError: 
4127:0:(osd_oi.c:762:osd_oi_insert()) dm-3: the FID [0x200000003:0x3:0x0] is 
used by two objects: 3080193/2625393380 86/3624058548
May 11 12:36:30 dybfs16 kernel: LustreError: 
3847:0:(osd_oi.c:762:osd_oi_insert()) dm-3: the FID [0x200000003:0x3:0x0] is 
used by two objects: 3080193/2625393380 87/358952283
May 11 12:54:56 dybfs16 kernel: LustreError: 
3479:0:(osd_oi.c:762:osd_oi_insert()) dm-2: the FID [0x200000003:0x3:0x0] is 
used by two objects: 3080193/2625393380 94/3669235547
May 11 13:09:27 dybfs16 kernel: LustreError: 
3846:0:(osd_oi.c:762:osd_oi_insert()) dm-3: the FID [0x200000003:0x3:0x0] is 
used by two objects: 3080193/2625393380 33537/2387867276
May 11 13:22:03 dybfs16 kernel: LustreError: 
4178:0:(osd_oi.c:762:osd_oi_insert()) dm-3: the FID [0x200000003:0x3:0x0] is 
used by two objects: 3080193/2625393380 33538/541222096
May 11 13:29:56 dybfs16 kernel: LustreError: 
3479:0:(osd_oi.c:762:osd_oi_insert()) dm-3: the FID [0x200000003:0x3:0x0] is 
used by two objects: 3080193/2625393380 33666/126694764
May 11 13:45:06 dybfs16 kernel: LustreError: 
4172:0:(osd_oi.c:762:osd_oi_insert()) dm-3: the FID [0x200000003:0x3:0x0] is 
used by two objects: 3080193/2625393380 33667/133493056
May 11 14:03:50 dybfs16 kernel: LustreError: 
3556:0:(osd_oi.c:762:osd_oi_insert()) dm-3: the FID [0x200000003:0x3:0x0] is 
used by two objects: 3080193/2625393380 33668/417086394
May 11 14:14:43 dybfs16 kernel: LustreError: 
3562:0:(osd_oi.c:762:osd_oi_insert()) dm-4: the FID [0x200000003:0x3:0x0] is 
used by two objects: 3080193/2625393380 33669/3676336670
May 11 14:36:10 dybfs16 kernel: LustreError: 
4572:0:(osd_oi.c:762:osd_oi_insert()) dm-4: the FID [0x200000003:0x3:0x0] is 
used by two objects: 3080193/2625393380 35841/3333636550


May 11 14:54:40 dybfs16 kernel: LustreError: 
3663:0:(osd_oi.c:762:osd_oi_insert()) dm-2: the FID [0x200000003:0x3:0x0] is 
used by two objects: 3080193/2625393380 35842/449209919

Used debugfs to locate conflicting inodes and unlink the duplicate 
objects(86,87~35842)
Ran e2fsck -fy /dev/mapper/mpathd multiple times,fix them to lost+found
Rebooted node and restarted Lustre stack


Current Lustre module status (lctl dl)
plaintext
  0 UP osd-ldiskfs dybfs2-OST0003-osd dybfs2-OST0003-osd_UUID 3
  1 UP osd-ldiskfs dybfs2-OST0000-osd dybfs2-OST0000-osd_UUID 4
  2 UP ost OSS OSS_uuid 2
  3 UP mgc MGC192.168.50.23@tcp 530c1b9f-3835-9edb-6b97-f9b3b7edca49 4
  4 UP obdfilter dybfs2-OST0000 dybfs2-OST0000_UUID 140
  5 UP lwp dybfs2-MDT0000-lwp-OST0000 dybfs2-MDT0000-lwp-OST0000_UUID 4
  6 UP osd-ldiskfs dybfs2-OST0001-osd dybfs2-OST0001-osd_UUID 4
  7 UP obdfilter dybfs2-OST0001 dybfs2-OST0001_UUID 214
  8 UP lwp dybfs2-MDT0000-lwp-OST0001 dybfs2-MDT0000-lwp-OST0001_UUID 4
  9 UP osd-ldiskfs dybfs2-OST0002-osd dybfs2-OST0002-osd_UUID 4
 10 UP obdfilter dybfs2-OST0002 dybfs2-OST0002_UUID 216
 11 UP lwp dybfs2-MDT0000-lwp-OST0002 dybfs2-MDT0000-lwp-OST0002_UUID 4
 12 UP osd-ldiskfs dybfs2-OST0004-osd dybfs2-OST0004-osd_UUID 4
 13 UP obdfilter dybfs2-OST0004 dybfs2-OST0004_UUID 384
 14 UP lwp dybfs2-MDT0000-lwp-OST0004 dybfs2-MDT0000-lwp-OST0004_UUID 4
 15 UP osd-ldiskfs dybfs2-OST0005-osd dybfs2-OST0005-osd_UUID 4
 16 UP obdfilter dybfs2-OST0005 dybfs2-OST0005_UUID 232
 17 UP lwp dybfs2-MDT0000-lwp-OST0005 dybfs2-MDT0000-lwp-OST0005_UUID 4

All other OSTs are loaded and running normally, only OST0003 cannot complete 
full setup and mount.
Questions
The OSD OI index of OST0003 seems severely corrupted, the same FID maps to 
hundreds of inodes. Is there a safe way to rebuild or reset the OI index in 
Lustre 2.12.5 without reformatting the whole OST?
Is there an official tool or procedure to batch clean up massive duplicate FID 
conflicts?
If in-place repair is impossible, what is the safest step-by-step procedure to 
recreate and re-add this OST to the existing filesystem with minimal impact on 
current data?
Any guidance or experience sharing will be greatly appreciated.


Environment
Lustre version: 2.12.5
OS & kernel: 3.10.0-1127.8.2.el7_lustre.x86_64
Backend filesystem: ldiskfs


Best regards,
Qiuling YAO
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to