I opened this Bugzilla issue for tracking purposes: https://bugzilla.redhat.com/show_bug.cgi?id=1481085
On Sun, Aug 13, 2017 at 8:05 AM, Mark Mielke <[email protected]> wrote: > I searched around for this a bit, and although other users may have hit > this, I didn't find a good explanation offered. I suspect the users clean > it up manually and then it disappears for another 2 years. I hope this > message will get captured by Google, and help somebody else out. Also, I > hope to have some discussion about this as it seems like an easily > preventable problem. > > The archive file names are generated like: > > if (dm_snprintf(archive_name, sizeof(archive_name), > "%s/%s_%05u-%d.vg", > dir, vg->name, ix, rnum) < 0) { > > The directory scanning code that loads the archive file names into memory > recognizes a problem, although it isn't explicit about what the problem is: > > /* Sort fails beyond 5-digit indexes */ > if ((count = scandir(dir, &dirent, NULL, alphasort)) < 0) { > log_error("Couldn't scan the archive directory (%s).", > dir); > return 0; > } > > The file names encode the index like "00000". The sorting code uses > "alphasort", which will only work properly as long as the index stays > within 5 digits. As soon as it exceeds 5 digits, it begins to sort the > "100000" to the beginning, and "99999" to the end. Then, new archives seems > to *all* be "100000". We had some 40,000 indexes with "100000" before we > noticed. And, because the index is followed by a random number, it would > only expire a few of the "100000" before it would hit one that was younger > than the 30 days retention period set by default. When I reduced the > retention period to 7 days, it expired only about 12 archive files of > 40,000 archive files. This behaviour is probably due to random number > distribution ensuring that there are always some recent records near 0? > > This issue eventually affects everyone, although obviously the people that > use features like snapshots more frequently (we use it every 15 minutes, > across multiple volumes) will hit it sooner, > > There are a few fixes possible... Probably, "alphasort" should not be used > at all, but a context aware sort should be used, that can filter and sort > as it goes, decoding the index correctly as a number, and comparing it as a > number. Then, if performance is desirable, and scalability, it would be > ideal if it did it in a single pass, and buffering only the minimum needed > to expire the correct archive files. > > We hit this on RHEL 7.2. I wasn't surprised to find it in RHEL 7.2, but I > was surprised that it still exists on "master". "git blame" says this has > been an issue since 2002: > > 5be981bab5 (Alasdair Kergon 2002-05-07 12:47:11 +0000 139) /* Sort > fails beyond 5-digit indexes */ > 59d6420b9a (Joe Thornber 2002-02-08 11:58:18 +0000 140) if ((count > = scandir(dir, &dirent, NULL, alphasort)) < 0) { > b8f47d5f69 (Alasdair Kergon 2009-07-15 20:02:46 +0000 141) > log_error("Couldn't scan the archive directory (%s).", dir); > 952d12a5f5 (Alasdair Kergon 2002-01-09 19:16:48 +0000 142) > return 0; > 952d12a5f5 (Alasdair Kergon 2002-01-09 19:16:48 +0000 143) } > > Ouch... :-) > > For anybody that does hit this.... Prune the archive files with index < > 100000 is effective. It starts counting from 100000, and you now have 9X > more life before it will happen again... :-) > > -- > Mark Mielke <[email protected]> > > -- Mark Mielke <[email protected]>
_______________________________________________ linux-lvm mailing list [email protected] https://www.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
