On 10/03/2014 10:29 PM, Jeff King wrote:
> Prune has to walk $GIT_DIR/objects/?? in order to find the
> set of loose objects to prune. Other parts of the code
> (e.g., count-objects) want to do the same. Let's factor it
> out into a reusable for_each-style function.
>
> Note that this is not quite a straight code movement. There
> are two differences:
>
> 1. The original code iterated from 0 to 256, trying to
> opendir("$GIT_DIR/%02x"). The new code just does a
> readdir() on the object directory, and descends into
> any matching directories. This is faster on
> already-pruned repositories, and should not ever be
> slower (nobody ever creates other files in the object
> directory).
This would change the order that the objects are processed. I doubt that
matters to anybody, but it's probably worth mentioning in the commit
message.
> 2. The original code had strange behavior when it found a
> file of the form "[0-9a-f]{2}/.{38}" that did _not_
> contain all hex digits. It executed a "break" from the
> loop, meaning that we stopped pruning in that directory
> (but still pruned other directories!). This was
> probably a bug; we do not want to process the file as
> an object, but we should keep going otherwise.
>
> Signed-off-by: Jeff King <[email protected]>
> ---
> I admit the speedup in (1) almost certainly doesn't matter. It is real,
> and I found out about it while writing a different program that was
> basically "count-objects" across a large number of repositories. However
> for a single repo it's probably not big enough to matter (calling
> count-objects in a loop while get dominated by the startup costs). The
> end result is a little more obvious IMHO, but that's subjective.
>
> builtin/prune.c | 87 ++++++++++++++++------------------------------------
> cache.h | 31 +++++++++++++++++++
> sha1_file.c | 95
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 152 insertions(+), 61 deletions(-)
>
> [...]
> diff --git a/cache.h b/cache.h
> index cd16e25..7abe7f6 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -1239,6 +1239,37 @@ extern unsigned long unpack_object_header_buffer(const
> unsigned char *buf, unsig
> extern unsigned long get_size_from_delta(struct packed_git *, struct
> pack_window **, off_t);
> extern int unpack_object_header(struct packed_git *, struct pack_window **,
> off_t *, unsigned long *);
>
> +/*
> + * Iterate over the files in the loose-object parts of the object
> + * directory "path", triggering the following callbacks:
> + *
> + * - loose_object is called for each loose object we find.
> + *
> + * - loose_cruft is called for any files that do not appear to be
> + * loose objects.
> + *
> + * - loose_subdir is called for each top-level hashed subdirectory
> + * of the object directory (e.g., "$OBJDIR/f0"). It is called
> + * after the objects in the directory are processed.
> + *
> + * Any callback that is NULL will be ignored. Callbacks returning non-zero
> + * will end the iteration.
> + */
> +typedef int each_loose_object_fn(const unsigned char *sha1,
> + const char *path,
> + void *data);
> +typedef int each_loose_cruft_fn(const char *basename,
> + const char *path,
> + void *data);
> +typedef int each_loose_subdir_fn(const char *basename,
> + const char *path,
> + void *data);
> +int for_each_loose_file_in_objdir(const char *path,
> + each_loose_object_fn obj_cb,
> + each_loose_cruft_fn cruft_cb,
> + each_loose_subdir_fn subdir_cb,
> + void *data);
> +
> struct object_info {
> /* Request */
> enum object_type *typep;
> diff --git a/sha1_file.c b/sha1_file.c
> index bae1c15..9fdad47 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -3218,3 +3218,98 @@ void assert_sha1_type(const unsigned char *sha1, enum
> object_type expect)
> die("%s is not a valid '%s' object", sha1_to_hex(sha1),
> typename(expect));
> }
> +
> +static int opendir_error(const char *path)
> +{
> + if (errno == ENOENT)
> + return 0;
> + return error("unable to open %s: %s", path, strerror(errno));
> +}
> +
> +static int for_each_file_in_obj_subdir(struct strbuf *path,
> + const char *prefix,
> + each_loose_object_fn obj_cb,
> + each_loose_cruft_fn cruft_cb,
> + each_loose_subdir_fn subdir_cb,
> + void *data)
> +{
> + size_t baselen = path->len;
> + DIR *dir = opendir(path->buf);
> + struct dirent *de;
> + int r = 0;
> +
> + if (!dir)
> + return opendir_error(path->buf);
OK, so if there is a non-directory named $GIT_DIR/objects/33, then we
emit an "unable to open" error rather than treating it as cruft. I think
this is reasonable.
> +
> + while ((de = readdir(dir))) {
> + if (is_dot_or_dotdot(de->d_name))
> + continue;
> +
> + strbuf_setlen(path, baselen);
> + strbuf_addf(path, "/%s", de->d_name);
> +
> + if (strlen(de->d_name) == 38) {
> + char hex[41];
> + unsigned char sha1[20];
> +
> + memcpy(hex, prefix, 2);
> + memcpy(hex + 2, de->d_name, 38);
> + hex[40] = 0;
> + if (!get_sha1_hex(hex, sha1)) {
> + if (obj_cb) {
> + r = obj_cb(sha1, path->buf, data);
> + if (r)
> + break;
> + }
> + continue;
> + }
> + }
> +
> + if (cruft_cb) {
> + r = cruft_cb(de->d_name, path->buf, data);
So, files *and* directories at the $GIT_DIR/objects/XX/ level are
reported as cruft (as opposed to, say, descending into the directories
and reporting any files found deeper in the hierarchy). This seems fine,
too.
> + if (r)
> + break;
> + }
> + }
> + if (!r && subdir_cb)
> + r = subdir_cb(de->d_name, path->buf, data);
By my reading, path->buf still contains the name of the last file in the
directory at this point. I assume you want to pass it the original
"baselen"-length path here.
> + closedir(dir);
> + return r;
...and anyway, it would be more polite to restore the path strbuf to its
original length before returning.
> +}
> +
> +int for_each_loose_file_in_objdir(const char *path,
> + each_loose_object_fn obj_cb,
> + each_loose_cruft_fn cruft_cb,
> + each_loose_subdir_fn subdir_cb,
> + void *data)
> +{
> + struct strbuf buf = STRBUF_INIT;
> + size_t baselen;
> + DIR *dir = opendir(path);
> + struct dirent *de;
> + int r = 0;
> +
> + if (!dir)
> + return opendir_error(path);
> +
> + strbuf_addstr(&buf, path);
> + baselen = buf.len;
> +
> + while ((de = readdir(dir))) {
> + if (!isxdigit(de->d_name[0]) ||
> + !isxdigit(de->d_name[1]) ||
> + de->d_name[2])
> + continue;
So other files or directories at the $GIT_DIR/objects/ level are just
ignored; they are not considered cruft. This is worth clarifying in the
docstring.
> +
> + strbuf_addf(&buf, "/%s", de->d_name);
> + r = for_each_file_in_obj_subdir(&buf, de->d_name, obj_cb,
> + cruft_cb, subdir_cb, data);
> + strbuf_setlen(&buf, baselen);
> + if (r)
> + break;
> + }
> +
> + closedir(dir);
> + strbuf_release(&buf);
> + return r;
> +}
>
Other than my comments above, it looks good to me.
Michael
--
Michael Haggerty
[email protected]
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html