There seems to have been a few instances of this reported here on the list in the last few months, I don't recall the earlier versions of lustre, but we have also seen this in the wild for customer systems, so very likely a bug which results in corruption of llog files.
-cf On Thu, Jun 1, 2017 at 11:36 AM, Dilger, Andreas <[email protected]> wrote: > On Jun 1, 2017, at 10:55, Faccini, Bruno <[email protected]> wrote: > > > > Hello, > > According to the error msgs, looks like there is a corrupted plain-LLOG > file for the ChangeLogs of MDT0. And unfortunately, neither e2fsck nor > lfsck can help to recover in this case. > > Bruno, > is this bug fixed in newer Lustre releases, or can something be done in > the ChangeLog handling so that the ChangeLog can still be cleared in this > case? I don't think we care if the record is invalid when it is being > deleted... Could you please file a ticket in Jira about this, if it isn't > already fixed. > > Cheers, Andreas > > > I think that to clear this situation you need to stop/umount this MDT > and re-mount it as ldiskfs to move both changelog_users and > changelog_catalog files to some alternate place/name (do not remove them!), > umount ldiskfs, re-start/mount your MDT, re-run a RBH full-scan, > re-register a ChangeLog user. > > Only side-effect doing so, can be the volume of orphan plain-LLOGs that > will be kept consuming space on MDT. You should be able to identify them by > running llog_reader tool over the saved/renamed old catalog file that will > list you the references to all these remaining plain-LLOGs, allowing you to > find+remove them during a new ldiskfs-mount session. > > > > Bruno. > > > >> On Jun 1, 2017, at 4:09 PM, Gibbins, Faye <[email protected]> > wrote: > >> > >> Hi, > >> > >> We have 4 file systems on our lustre cluster. All have changelog users > registered for robinhood to use. > >> > >> We have discovered that a changelog user for one of the file systems is > not catching up to its index. Manual runs of Robinhood fail to read any > more records even though according to mdd/tools-MDT0000/changelog_users > there are record to read! > >> > >> Over time the change log had filled and the file system had become > sluggish. Wiping the robinhood mysql and reinitializing robin hood with a > full scan didn’t fix the issue and like I said above three other change > logs from different file systems (on the same MSG) are ok when used from > the same robinhood instance. > >> > >> What makes me think this is a lustre (and we are using 2.8 on ext4) > problem is this (repeated) error we are getting in syslog: > >> > >> [Wed May 31 14:06:59 2017] Lustre: > >> 46400:0:(llog.c:530:llog_process_thread()) > invalid length -420090294 in llog record for index 372672342/61708 > >> [Wed May 31 14:06:59 2017] LustreError: > >> 46400:0:(mdd_device.c:261:llog_changelog_cancel()) > tools-MDD0000: cancel idx 645 of catalog 0x7:10 rc=-22 > >> > >> Deregistering the user from the change log and starting with a new one > has not changed the behaviour and we still can’t use this new user to track > changes to the file system. > >> > >> Can anyone offer any advice on how to resolve this issue in the > changelog? > >> If not can anyone confirm if taking the file system down for a > e2fsck/lfsck will fix issues with the changelog? I’d settle for being able > to clear the whole log and starting afresh if that’s possible? > >> > >> Yours > >> Faye Gibbins > >> Snr SysAdmin, Unix Lead Architect > >> Software Systems and Cloud Services > >> Cirrus Logic | cirrus.com | +44 (0) 131 272 7398 > >> > >> <image001.png> > >> > >> This message and any attachments may contain privileged and > confidential information that is intended solely for the person(s) to whom > it is addressed. If you are not an intended recipient you must not: read; > copy; distribute; discuss; take any action in or make any reliance upon the > contents of this message; nor open or read any attachment. If you have > received this message in error, please notify us as soon as possible on the > following telephone number and destroy this message including any > attachments. Thank you. Cirrus Logic International (UK) Ltd and Cirrus > Logic International Semiconductor Ltd are companies registered in Scotland, > with registered numbers SC089839 and SC495735 respectively. Our registered > office is at 7B Nightingale Way, Quartermile, Edinburgh, EH3 9EG, UK. Tel: +44 > (0)131 272 7000. cirrus.com_______________________________________________ > >> lustre-discuss mailing list > >> [email protected] > >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > > > > --------------------------------------------------------------------- > > Intel Corporation SAS (French simplified joint stock company) > > Registered headquarters: "Les Montalets"- 2, rue de Paris, > > 92196 Meudon Cedex, France > > Registration Number: 302 456 199 R.C.S. NANTERRE > > Capital: 4,572,000 Euros > > > > This e-mail and any attachments may contain confidential material for > > the sole use of the intended recipient(s). Any review or distribution > > by others is strictly prohibited. If you are not the intended > > recipient, please contact the sender and delete all copies. > > > > _______________________________________________ > > lustre-discuss mailing list > > [email protected] > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > > Cheers, Andreas > -- > Andreas Dilger > Lustre Principal Architect > Intel Corporation > > > > > > > > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
