Greetings, We're looking for suggestions on how to interpret the status output of the various stages of the 'lctl lfsck_start' command. In particular the oi_scrub failed counts.
The manual states the following in the 'LFSCK status of OI Scrub' section: 'Failed - total number of objects that failed to be repaired.' A recent 'lctl lfsck_start -M Name-MDT0000' to verify OI, layout and namespace reported high failed counts on the oi_scrub for all of our OSTs in the FS. This was unexpected. We were running an online lfsck because we had a single OST go read-only whilst the underlying RAID6 hardware was rebuilding a disk and had a long period of not responding to I/O (I'll spare this tale of woe). The resulting e2fsck'd OST had 6 zero sized, trusted.lma extended attribute containing "Unattached inode" that were routed manually to /lost+found. These 6 inodes showed up in the /proc/fs/lustre/osd-ldiskfs/<OST>/oi_scrub file: lf_scanned: 6 lf_repaired: 6 However this and the 49 other OSTs also showed 'failed:' counts in oi_scrub, ranging between ~45000 and ~50200 for the low and high end of the ranges respectively, a snippet of the OST having the above lf_* counts: first_failure_position: 87 checked: 1784231 updated: 327 failed: 47725 prior_updated: 0 noscrub: 225 igif: 1 success_count: 1 All of the OST oi_scrub status files had the following: first_failure_position: 87 All the OSSs have the following default debug settings: lctl get_param debug debug=ioctl neterror warning error emerg ha config console lfsck Performing a 'lctl debug_kernel dk.txt' on the OSSs and looking for LFSCK subsystem/debug_mask lines appearing to be involved with scrub activities were _much_ smaller than the failed counts. The scrub LFSCK debug lines looked similar to the following: 00100000:10000000:17.0:1545248778.311791:0:13132:0:(osd_scrub.c:454:osd_scrub_convert_ff()) Name-OST002a-osd: fail to convert ff [0x100000000:0xb0:0x0]: rc = -17 and I assume -17 is -EEXIST. Should we be concerned about these failed counts? If so, how do we match failed counts in LFSCK status output to Lustre debug lines so we can find the cause and try to resolve the problem? We're running Lustre 2.8.0 on the servers and clients in case that matters at all. Thank you in advance for any wisdom you can share, -Josh _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org