Hi Lustre Community,
I'm asking for help with a severe Lustre OST issue after backend disk array
failure.
The storage array failure caused 4 out of 6 OSTs offline.
After e2fsck repair, 5 OSTs recovered and work normally.
Only dybfs2-OST0003(device /dev/mapper/mpathd) fails to mount with error -17
File exists.
May 11 14:54:40 dybfs16 kernel: LustreError:
3663:0:(osd_oi.c:762:osd_oi_insert()) dm-2: the FID [0x200000003:0x3:0x0] is
used by two objects: 3080193/2625393380 35842/449209919
May 11 14:54:40 dybfs16 kernel: LustreError:
3663:0:(obd_config.c:559:class_setup()) setup dybfs2-OST0003 failed (-17)
May 11 14:54:40 dybfs16 kernel: LustreError:
3663:0:(obd_config.c:1835:class_config_llog_handler()) MGC192.168.50.23@tcp:
cfg command failed: rc = -17
May 11 14:54:40 dybfs16 kernel: Lustre: cmd=cf003 0:dybfs2-OST0003 1:dev
2:0 3:f
May 11 14:54:40 dybfs16 kernel: LustreError: 15c-8: MGC192.168.50.23@tcp: The
configuration from log 'dybfs2-OST0003' failed (-17). This may be the result of
communication errors between
this node and the MGS, a bad configuration, or other errors. See the syslog for
more information.
May 11 14:54:40 dybfs16 kernel: LustreError:
3526:0:(obd_mount_server.c:1397:server_start_targets()) failed to start server
dybfs2-OST0003: -17
May 11 14:54:40 dybfs16 kernel: LustreError:
3526:0:(obd_mount_server.c:1992:server_fill_super()) Unable to start targets:
-17
May 11 14:54:40 dybfs16 kernel: LustreError:
3526:0:(obd_config.c:610:class_cleanup()) Device 3 not setup
May 11 14:54:40 dybfs16 kernel: Lustre: server umount dybfs2-OST0003 complete
May 11 14:54:40 dybfs16 kernel: LustreError:
3526:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount
/dev/mapper/mpathd (-17)
Sample kernel errors:
plaintext
May 11 12:23:16 dybfs16 kernel: LustreError:
4127:0:(osd_oi.c:762:osd_oi_insert()) dm-3: the FID [0x200000003:0x3:0x0] is
used by two objects: 3080193/2625393380 86/3624058548
May 11 12:36:30 dybfs16 kernel: LustreError:
3847:0:(osd_oi.c:762:osd_oi_insert()) dm-3: the FID [0x200000003:0x3:0x0] is
used by two objects: 3080193/2625393380 87/358952283
May 11 12:54:56 dybfs16 kernel: LustreError:
3479:0:(osd_oi.c:762:osd_oi_insert()) dm-2: the FID [0x200000003:0x3:0x0] is
used by two objects: 3080193/2625393380 94/3669235547
May 11 13:09:27 dybfs16 kernel: LustreError:
3846:0:(osd_oi.c:762:osd_oi_insert()) dm-3: the FID [0x200000003:0x3:0x0] is
used by two objects: 3080193/2625393380 33537/2387867276
May 11 13:22:03 dybfs16 kernel: LustreError:
4178:0:(osd_oi.c:762:osd_oi_insert()) dm-3: the FID [0x200000003:0x3:0x0] is
used by two objects: 3080193/2625393380 33538/541222096
May 11 13:29:56 dybfs16 kernel: LustreError:
3479:0:(osd_oi.c:762:osd_oi_insert()) dm-3: the FID [0x200000003:0x3:0x0] is
used by two objects: 3080193/2625393380 33666/126694764
May 11 13:45:06 dybfs16 kernel: LustreError:
4172:0:(osd_oi.c:762:osd_oi_insert()) dm-3: the FID [0x200000003:0x3:0x0] is
used by two objects: 3080193/2625393380 33667/133493056
May 11 14:03:50 dybfs16 kernel: LustreError:
3556:0:(osd_oi.c:762:osd_oi_insert()) dm-3: the FID [0x200000003:0x3:0x0] is
used by two objects: 3080193/2625393380 33668/417086394
May 11 14:14:43 dybfs16 kernel: LustreError:
3562:0:(osd_oi.c:762:osd_oi_insert()) dm-4: the FID [0x200000003:0x3:0x0] is
used by two objects: 3080193/2625393380 33669/3676336670
May 11 14:36:10 dybfs16 kernel: LustreError:
4572:0:(osd_oi.c:762:osd_oi_insert()) dm-4: the FID [0x200000003:0x3:0x0] is
used by two objects: 3080193/2625393380 35841/3333636550
May 11 14:54:40 dybfs16 kernel: LustreError:
3663:0:(osd_oi.c:762:osd_oi_insert()) dm-2: the FID [0x200000003:0x3:0x0] is
used by two objects: 3080193/2625393380 35842/449209919
Used debugfs to locate conflicting inodes and unlink the duplicate
objects(86,87~35842)
Ran e2fsck -fy /dev/mapper/mpathd multiple times,fix them to lost+found
Rebooted node and restarted Lustre stack
Current Lustre module status (lctl dl)
plaintext
0 UP osd-ldiskfs dybfs2-OST0003-osd dybfs2-OST0003-osd_UUID 3
1 UP osd-ldiskfs dybfs2-OST0000-osd dybfs2-OST0000-osd_UUID 4
2 UP ost OSS OSS_uuid 2
3 UP mgc MGC192.168.50.23@tcp 530c1b9f-3835-9edb-6b97-f9b3b7edca49 4
4 UP obdfilter dybfs2-OST0000 dybfs2-OST0000_UUID 140
5 UP lwp dybfs2-MDT0000-lwp-OST0000 dybfs2-MDT0000-lwp-OST0000_UUID 4
6 UP osd-ldiskfs dybfs2-OST0001-osd dybfs2-OST0001-osd_UUID 4
7 UP obdfilter dybfs2-OST0001 dybfs2-OST0001_UUID 214
8 UP lwp dybfs2-MDT0000-lwp-OST0001 dybfs2-MDT0000-lwp-OST0001_UUID 4
9 UP osd-ldiskfs dybfs2-OST0002-osd dybfs2-OST0002-osd_UUID 4
10 UP obdfilter dybfs2-OST0002 dybfs2-OST0002_UUID 216
11 UP lwp dybfs2-MDT0000-lwp-OST0002 dybfs2-MDT0000-lwp-OST0002_UUID 4
12 UP osd-ldiskfs dybfs2-OST0004-osd dybfs2-OST0004-osd_UUID 4
13 UP obdfilter dybfs2-OST0004 dybfs2-OST0004_UUID 384
14 UP lwp dybfs2-MDT0000-lwp-OST0004 dybfs2-MDT0000-lwp-OST0004_UUID 4
15 UP osd-ldiskfs dybfs2-OST0005-osd dybfs2-OST0005-osd_UUID 4
16 UP obdfilter dybfs2-OST0005 dybfs2-OST0005_UUID 232
17 UP lwp dybfs2-MDT0000-lwp-OST0005 dybfs2-MDT0000-lwp-OST0005_UUID 4
All other OSTs are loaded and running normally, only OST0003 cannot complete
full setup and mount.
Questions
The OSD OI index of OST0003 seems severely corrupted, the same FID maps to
hundreds of inodes. Is there a safe way to rebuild or reset the OI index in
Lustre 2.12.5 without reformatting the whole OST?
Is there an official tool or procedure to batch clean up massive duplicate FID
conflicts?
If in-place repair is impossible, what is the safest step-by-step procedure to
recreate and re-add this OST to the existing filesystem with minimal impact on
current data?
Any guidance or experience sharing will be greatly appreciated.
Environment
Lustre version: 2.12.5
OS & kernel: 3.10.0-1127.8.2.el7_lustre.x86_64
Backend filesystem: ldiskfs
Best regards,
Qiuling YAO
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org