Re: [PATCH v2] repack: Add option to preserve and prune old pack files
On Sunday, March 12, 2017 11:03:44 AM Junio C Hamano wrote: > Jeff King writes: > > I can think of one downside of a time-based solution, > > though: if you run multiple gc's during the time > > period, you may end up using a lot of disk space (one > > repo's worth per gc). But that's a fundamental tension > > in the problem space; the whole point is to waste disk > > to keep helping old processes. > > Yes. If you want to help a process that mmap's a packfile > and wants to keep using it for N seconds, no matter how > many times somebody else ran "git repack" while you are > doing your work within that timeframe, you somehow need > to make sure the NFS server knows the file is still in > use for that N seconds. > > > But you may want a knob that lets you slide between > > those two things. For instance, if you kept a sliding > > window of N sets of preserved packs, and ejected from > > one end of the window (regardless of time), while > > inserting into the other end. James' existing patch is > > that same strategy with a hardcoded window of "1". > > Again, yes. But then the user does not get any guarantee > of how long-living a process the user can have without > getting broken by the NFS server losing track of a > packfile that is still in use. My suggestion for the > "expiry" based approach is essentially that I do not see > a useful tradeoff afforded by having such a knob. > > The other variable you can manipulate is whether to gc > > in the first place. E.g., don't gc if there are N > > preserved sets (or sets consuming more than N bytes, or > > whatever). You could do that check outside of git > > entirely (or in an auto-gc hook, if you're using it). > Yes, "don't gc/repack more than once within N seconds" may > also be an alternative and may generally be more useful > by covering general source of wastage coming from doing > gc too frequently, not necessarily limited to preserved > pack accumulation. As someone who helps manage a Gerrit server for several thousand repos, all on the same NFS disks, a time based expiry seems unpractical, and not something that I am very interested in having. I favor the simpler (single for now) repacking cycle approach, and it is what we have been using for almost 6 months now successfully, without suffering any more stale file handle exceptions. While time is indeed the factor that is going to determine whether someone is going to see the stale file handles or not, on a server (which is what this feature is aimed at), this is secondary in my mind to predictability about space utilization. I have no specific minimum time that I can reason about, i.e. I cannot reasonably say "I want all operations that last less than 1 hour, 1 min, or 1 second... to succeed". I don't really want ANY failures, and I am willing to sacrifice some disk space to prevent as many as possible. So the question to me is "How much disk space am I willing to sacrifice?", not "How long do I want operations to be able to last?". The only way that time enters my equation is to compare it to how long repacking takes, i.e. I want the preserved files cleaned up on the next repack. So effectively I am choosing a repacking cycle based approach, so that I can reasonably predict the extra disk space that I need to reserve for my collection of repos. With a single cycle, I am effectively doubling the "static" size of repos. To achieve this predictability with a time based approach requires coordination between the expiry setting and the repacking time cycle. This coordination is extra effort for me, with no apparent gain. It is also an additional risk that I don't want to have. If I decide to bump up how often I run repacking, and I forget to reduce the expiry time, my disk utilization will grow and potentially cause serious issues for all my repositories (since they share the same volume). This problem is even more difficult if I decide to use a usage (instead of time) based algorithm to determine when I repack. Admittedly, a repacking cycle based approach happens to be very easy and practical when it is a "single" cycle. If I determine eventually empirically that a single cycle is not long enough for my server, I don't know what I will do? Perhaps I would then want a switch that preserves the repos for another cycle? Maybe it could work the way that log rotation works, add a number to the end of each file name for each preserved cycle? This option seems preferable to me than a time based approach because it makes it more obvious what the impact on disk utilization will be. However, so far in practice, this does not seem necessary. I don't really see a good use case for a time based expiry (other than "this is how it was done for other things in git"). Of course, that doesn't mean such a use case doesn't exist, but I don't support adding a feature unless I really understand why and how someone would want to use it. I think tha
Re: [PATCH v2] repack: Add option to preserve and prune old pack files
Jeff King writes: > I can think of one downside of a time-based solution, though: if you run > multiple gc's during the time period, you may end up using a lot of disk > space (one repo's worth per gc). But that's a fundamental tension in the > problem space; the whole point is to waste disk to keep helping old > processes. Yes. If you want to help a process that mmap's a packfile and wants to keep using it for N seconds, no matter how many times somebody else ran "git repack" while you are doing your work within that timeframe, you somehow need to make sure the NFS server knows the file is still in use for that N seconds. > But you may want a knob that lets you slide between those two > things. For instance, if you kept a sliding window of N sets of > preserved packs, and ejected from one end of the window (regardless of > time), while inserting into the other end. James' existing patch is that > same strategy with a hardcoded window of "1". Again, yes. But then the user does not get any guarantee of how long-living a process the user can have without getting broken by the NFS server losing track of a packfile that is still in use. My suggestion for the "expiry" based approach is essentially that I do not see a useful tradeoff afforded by having such a knob. > The other variable you can manipulate is whether to gc in the first > place. E.g., don't gc if there are N preserved sets (or sets consuming > more than N bytes, or whatever). You could do that check outside of git > entirely (or in an auto-gc hook, if you're using it). Yes, "don't gc/repack more than once within N seconds" may also be an alternative and may generally be more useful by covering general source of wastage coming from doing gc too frequently, not necessarily limited to preserved pack accumulation.
Re: [PATCH v2] repack: Add option to preserve and prune old pack files
On Fri, Mar 10, 2017 at 03:43:43PM -0800, Junio C Hamano wrote: > James Melvin writes: > > > The new --preserve-and-prune option renames old pack files > > instead of deleting them after repacking and prunes previously > > preserved pack files. > > > > This option is designed to prevent stale file handle exceptions > > during git operations which can happen on users of NFS repos when > > repacking is done on them. The strategy is to preserve old pack files > > around until the next repack with the hopes that they will become > > unreferenced by then and not cause any exceptions to running processes > > when they are finally deleted (pruned). > > This certainly is simpler than the previous one, but if you run > > git repack --preserve-and-prune > sleep for N minutes > git repack --preserve-and-prune > > the second "repack" will drop the obsoleted packs that were > preserved by the first one no matter how short the value of N is, > no? > > I suspect that users would be more comfortable with something based > on expiration timestamp that gives them a guaranteed grace period, > e.g. "--preserve-expire=8.hours", like how we expire reflog entries > and loose objects. > > Perhaps builtin/prune.c can be a source of inspiration? I have been a bit slow to read this series, but FWIW I had the exact same thought while reading up to this point. There should be an expiration, and that expiration time corresponds roughly to "how long do you expect a long-running git operation to last". You'd probably want "--preserve-expire" as an option to repack, and then a "gc.preservePackExpire" option so that "git gc" triggers it reliably. I can think of one downside of a time-based solution, though: if you run multiple gc's during the time period, you may end up using a lot of disk space (one repo's worth per gc). But that's a fundamental tension in the problem space; the whole point is to waste disk to keep helping old processes. But you may want a knob that lets you slide between those two things. For instance, if you kept a sliding window of N sets of preserved packs, and ejected from one end of the window (regardless of time), while inserting into the other end. James' existing patch is that same strategy with a hardcoded window of "1". The other variable you can manipulate is whether to gc in the first place. E.g., don't gc if there are N preserved sets (or sets consuming more than N bytes, or whatever). You could do that check outside of git entirely (or in an auto-gc hook, if you're using it). Note that something like a sliding window gets complicated pretty quickly. I'm not really proposing it as a direction; I'm just trying to think of ways the time-based system could go wrong. IMHO it would probably be fine to just ignore the problem and assume that repacking doesn't happen often enough for it to come up in practice. -Peff
Re: [PATCH v2] repack: Add option to preserve and prune old pack files
On Fri, Mar 10, 2017 at 11:00 PM, James Melvin wrote: > The new --preserve-and-prune option renames old pack files > instead of deleting them after repacking and prunes previously > preserved pack files. I think some of this rationale... > This option is designed to prevent stale file handle exceptions > during git operations which can happen on users of NFS repos when > repacking is done on them. The strategy is to preserve old pack files > around until the next repack with the hopes that they will become > unreferenced by then and not cause any exceptions to running processes > when they are finally deleted (pruned). ...belongs in the actual docs, i.e. here: > +--preserve-and-prune:: > +Preserve old pack files by renaming them instead of deleting. Prune > any > +previously preserved pack files before preserving new ones. > + This is a really obscure option with an obscure use-case, let's explain that in the docs, and also mention NFS & what problems it solves there, so someone having the same issue can find the solution more easily.
Re: [PATCH v2] repack: Add option to preserve and prune old pack files
James Melvin writes: > The new --preserve-and-prune option renames old pack files > instead of deleting them after repacking and prunes previously > preserved pack files. > > This option is designed to prevent stale file handle exceptions > during git operations which can happen on users of NFS repos when > repacking is done on them. The strategy is to preserve old pack files > around until the next repack with the hopes that they will become > unreferenced by then and not cause any exceptions to running processes > when they are finally deleted (pruned). This certainly is simpler than the previous one, but if you run git repack --preserve-and-prune sleep for N minutes git repack --preserve-and-prune the second "repack" will drop the obsoleted packs that were preserved by the first one no matter how short the value of N is, no? I suspect that users would be more comfortable with something based on expiration timestamp that gives them a guaranteed grace period, e.g. "--preserve-expire=8.hours", like how we expire reflog entries and loose objects. Perhaps builtin/prune.c can be a source of inspiration?
[PATCH v2] repack: Add option to preserve and prune old pack files
The new --preserve-and-prune option renames old pack files instead of deleting them after repacking and prunes previously preserved pack files. This option is designed to prevent stale file handle exceptions during git operations which can happen on users of NFS repos when repacking is done on them. The strategy is to preserve old pack files around until the next repack with the hopes that they will become unreferenced by then and not cause any exceptions to running processes when they are finally deleted (pruned). Signed-off-by: James Melvin --- Documentation/git-repack.txt | 4 +++ builtin/repack.c | 66 +++- 2 files changed, 69 insertions(+), 1 deletion(-) diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt index 26afe6ed5..effd98a43 100644 --- a/Documentation/git-repack.txt +++ b/Documentation/git-repack.txt @@ -143,6 +143,10 @@ other objects in that pack they already have locally. being removed. In addition, any unreachable loose objects will be packed (and their loose counterparts removed). +--preserve-and-prune:: +Preserve old pack files by renaming them instead of deleting. Prune any +previously preserved pack files before preserving new ones. + Configuration - diff --git a/builtin/repack.c b/builtin/repack.c index 677bc7c81..78fad8f1a 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -10,6 +10,7 @@ static int delta_base_offset = 1; static int pack_kept_objects = -1; +static int preserve_and_prune; static int write_bitmaps; static char *packdir, *packtmp; @@ -108,6 +109,60 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list) closedir(dir); } +/* + * Adds all preserved packs hex strings to the fname list + */ +static void get_preserved_pack_filenames(struct string_list *fname_list) +{ + DIR *dir; + struct dirent *e; + char *fname; + + if (!(dir = opendir(packdir))) + return; + + while ((e = readdir(dir)) != NULL) { + size_t len; + if (!strip_suffix(e->d_name, ".old-pack", &len)) + continue; + + fname = xmemdupz(e->d_name, len); + string_list_append_nodup(fname_list, fname); + } + closedir(dir); +} + +static void remove_preserved_packs(void) { + const char *exts[] = {"pack", "idx", "keep", "bitmap"}; + int i; + struct string_list names = STRING_LIST_INIT_DUP; + struct string_list_item *item; + + get_preserved_pack_filenames(&names); + + for_each_string_list_item(item, &names) { + for (i = 0; i < ARRAY_SIZE(exts); i++) { + char *fname; + fname = mkpathdup("%s/%s.old-%s", + packdir, + item->string, + exts[i]); + if (remove_path(fname)) + warning(_("failed to remove '%s'"), fname); + free(fname); + } + } +} + +static void preserve_pack(const char *file_path, const char *file_name, const char *file_ext) +{ + char *fname_old; + + fname_old = mkpathdup("%s/%s.old-%s", packdir, file_name, ++file_ext); + rename(file_path, fname_old); + free(fname_old); +} + static void remove_redundant_pack(const char *dir_name, const char *base_name) { const char *exts[] = {".pack", ".idx", ".keep", ".bitmap"}; @@ -121,7 +176,10 @@ static void remove_redundant_pack(const char *dir_name, const char *base_name) for (i = 0; i < ARRAY_SIZE(exts); i++) { strbuf_setlen(&buf, plen); strbuf_addstr(&buf, exts[i]); - unlink(buf.buf); + if (preserve_and_prune) + preserve_pack(buf.buf, base_name, exts[i]); + else + unlink(buf.buf); } strbuf_release(&buf); } @@ -194,6 +252,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix) N_("maximum size of each packfile")), OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects, N_("repack objects in packs marked with .keep")), + OPT_BOOL(0, "preserve-and-prune", &preserve_and_prune, + N_("preserve old pack files by renaming them instead of deleting, prune any " + "previously preserved pack files before preserving new ones")), OPT_END() }; @@ -404,6 +465,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix) /* End of pack replacement. */ + if (preserve_and_prune) + remove_preserved_packs(); + if (delete_redundant) { int opts = 0;