Hi Rick, I did attempt that, and while subsequent access didn't cause an MDS panic, the client threw errors like "cannot get group lock: Invalid argument (22)".
I'm going to attempt the patch and workaround from LU-16194 suggested by Andreas a couple hours ago on the LU-16152 bug report. My guess is that normal people set the PFL components directly as arguments to lfs setstripe, or reference an existing file's PFL with --copy. Both of those methods work fine, but I took the fancy yaml route. Thanks, Nate On Tue, Oct 25, 2022 at 2:51 PM Mohr, Rick <[email protected]> wrote: > Nate, > > For the example layout you attached, it looks like the file does not have > any data in the components with the messed up extent_end value. Have you > tried using "lfs setstripe --component-del" to delete just those messed up > components and see if you can then access the data? > > --Rick > > > On 10/25/22, 4:43 PM, "lustre-discuss on behalf of Nathan Crawford" < > [email protected] on behalf of [email protected]> > wrote: > > Hi All, > I'm looking for possible work-arounds to recover data from some > mis-migrated files (as seen in LU-16152). Basically, there's a bug in "lfs > setstripe --yaml" where extent start/end values in the yaml file >= 2GiB > overflow to 16 EiB - 2 GiB. > > Using lfs_migrate, I re-striped many files in directories with a > default striping pattern containing these values. I'm pretty sure that the > data exists (was trying to purge an older OST, and disk usage on the other > OSTs increased as the purged OST decreased), and an lfsck procedure happily > returns after a day or so. Unfortunately, attempts to access or re-migrate > the files triggers a kernel panic on the MDS with: > > LustreError: 12576:0:(osd_io.c:311:kmem_to_page()) ASSERTION( > !((unsigned long)addr & ~(~(((1UL) << 12)-1))) ) failed: > LustreError: 12576:0:(osd_io.c:311:kmem_to_page()) LBUG > > Kernel panic - not syncing: LBUG > > > The servers are lustre 2.12.8 on OpenZFS 0.8.5 on CentOS 7.9. The > output from "lfs getstripe -v badfile" is attached. > > I can use lfs find to search for files with these bad extent > endpoint values, then move them to a quarantine area on the same FS. This > will allow the rest of the system to stay up (hopefully) but recovering the > data is still needed. > > Thanks! > Nate > > -- > Dr. Nathan Crawford [email protected] > Director of Scientific Computing > School of Physical Sciences > 164 Rowland Hall Office: 152 Rowland Hall > University of California, Irvine Phone: 949-824-1380 > Irvine, CA 92697-2025, USA > > -- Dr. Nathan Crawford [email protected] Director of Scientific Computing School of Physical Sciences 164 Rowland Hall Office: 152 Rowland Hall University of California, Irvine Phone: 949-824-1380 Irvine, CA 92697-2025, USA
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
