Re: [PATCH 09/16] prune: factor out loose-object directory traversal

Michael Haggerty Tue, 07 Oct 2014 07:08:55 -0700

On 10/03/2014 10:29 PM, Jeff King wrote:
> Prune has to walk $GIT_DIR/objects/?? in order to find the
> set of loose objects to prune. Other parts of the code
> (e.g., count-objects) want to do the same. Let's factor it
> out into a reusable for_each-style function.
> 
> Note that this is not quite a straight code movement. There
> are two differences:
> 
>   1. The original code iterated from 0 to 256, trying to
>      opendir("$GIT_DIR/%02x"). The new code just does a
>      readdir() on the object directory, and descends into
>      any matching directories. This is faster on
>      already-pruned repositories, and should not ever be
>      slower (nobody ever creates other files in the object
>      directory).


This would change the order that the objects are processed. I doubt that
matters to anybody, but it's probably worth mentioning in the commit
message.

>   2. The original code had strange behavior when it found a
>      file of the form "[0-9a-f]{2}/.{38}" that did _not_
>      contain all hex digits. It executed a "break" from the
>      loop, meaning that we stopped pruning in that directory
>      (but still pruned other directories!). This was
>      probably a bug; we do not want to process the file as
>      an object, but we should keep going otherwise.
> 
> Signed-off-by: Jeff King <[email protected]>
> ---
> I admit the speedup in (1) almost certainly doesn't matter. It is real,
> and I found out about it while writing a different program that was
> basically "count-objects" across a large number of repositories. However
> for a single repo it's probably not big enough to matter (calling
> count-objects in a loop while get dominated by the startup costs). The
> end result is a little more obvious IMHO, but that's subjective.
> 
>  builtin/prune.c | 87 ++++++++++++++++------------------------------------
>  cache.h         | 31 +++++++++++++++++++
>  sha1_file.c     | 95 
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 152 insertions(+), 61 deletions(-)
> 
> [...]
> diff --git a/cache.h b/cache.h
> index cd16e25..7abe7f6 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -1239,6 +1239,37 @@ extern unsigned long unpack_object_header_buffer(const 
> unsigned char *buf, unsig
>  extern unsigned long get_size_from_delta(struct packed_git *, struct 
> pack_window **, off_t);
>  extern int unpack_object_header(struct packed_git *, struct pack_window **, 
> off_t *, unsigned long *);
>  
> +/*
> + * Iterate over the files in the loose-object parts of the object
> + * directory "path", triggering the following callbacks:
> + *
> + *  - loose_object is called for each loose object we find.
> + *
> + *  - loose_cruft is called for any files that do not appear to be
> + *    loose objects.
> + *
> + *  - loose_subdir is called for each top-level hashed subdirectory
> + *    of the object directory (e.g., "$OBJDIR/f0"). It is called
> + *    after the objects in the directory are processed.
> + *
> + * Any callback that is NULL will be ignored. Callbacks returning non-zero
> + * will end the iteration.
> + */
> +typedef int each_loose_object_fn(const unsigned char *sha1,
> +                              const char *path,
> +                              void *data);
> +typedef int each_loose_cruft_fn(const char *basename,
> +                             const char *path,
> +                             void *data);
> +typedef int each_loose_subdir_fn(const char *basename,
> +                              const char *path,
> +                              void *data);
> +int for_each_loose_file_in_objdir(const char *path,
> +                               each_loose_object_fn obj_cb,
> +                               each_loose_cruft_fn cruft_cb,
> +                               each_loose_subdir_fn subdir_cb,
> +                               void *data);
> +
>  struct object_info {
>       /* Request */
>       enum object_type *typep;
> diff --git a/sha1_file.c b/sha1_file.c
> index bae1c15..9fdad47 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -3218,3 +3218,98 @@ void assert_sha1_type(const unsigned char *sha1, enum 
> object_type expect)
>               die("%s is not a valid '%s' object", sha1_to_hex(sha1),
>                   typename(expect));
>  }
> +
> +static int opendir_error(const char *path)
> +{
> +     if (errno == ENOENT)
> +             return 0;
> +     return error("unable to open %s: %s", path, strerror(errno));
> +}
> +
> +static int for_each_file_in_obj_subdir(struct strbuf *path,
> +                                    const char *prefix,
> +                                    each_loose_object_fn obj_cb,
> +                                    each_loose_cruft_fn cruft_cb,
> +                                    each_loose_subdir_fn subdir_cb,
> +                                    void *data)
> +{
> +     size_t baselen = path->len;
> +     DIR *dir = opendir(path->buf);
> +     struct dirent *de;
> +     int r = 0;
> +
> +     if (!dir)
> +             return opendir_error(path->buf);

OK, so if there is a non-directory named $GIT_DIR/objects/33, then we
emit an "unable to open" error rather than treating it as cruft. I think
this is reasonable.

> +
> +     while ((de = readdir(dir))) {
> +             if (is_dot_or_dotdot(de->d_name))
> +                     continue;
> +
> +             strbuf_setlen(path, baselen);
> +             strbuf_addf(path, "/%s", de->d_name);
> +
> +             if (strlen(de->d_name) == 38)  {
> +                     char hex[41];
> +                     unsigned char sha1[20];
> +
> +                     memcpy(hex, prefix, 2);
> +                     memcpy(hex + 2, de->d_name, 38);
> +                     hex[40] = 0;
> +                     if (!get_sha1_hex(hex, sha1)) {
> +                             if (obj_cb) {
> +                                     r = obj_cb(sha1, path->buf, data);
> +                                     if (r)
> +                                             break;
> +                             }
> +                             continue;
> +                     }
> +             }
> +
> +             if (cruft_cb) {
> +                     r = cruft_cb(de->d_name, path->buf, data);

So, files *and* directories at the $GIT_DIR/objects/XX/ level are
reported as cruft (as opposed to, say, descending into the directories
and reporting any files found deeper in the hierarchy). This seems fine,
too.

> +                     if (r)
> +                             break;
> +             }
> +     }
> +     if (!r && subdir_cb)
> +             r = subdir_cb(de->d_name, path->buf, data);

By my reading, path->buf still contains the name of the last file in the
directory at this point. I assume you want to pass it the original
"baselen"-length path here.

> +     closedir(dir);
> +     return r;

...and anyway, it would be more polite to restore the path strbuf to its
original length before returning.

> +}
> +
> +int for_each_loose_file_in_objdir(const char *path,
> +                         each_loose_object_fn obj_cb,
> +                         each_loose_cruft_fn cruft_cb,
> +                         each_loose_subdir_fn subdir_cb,
> +                         void *data)
> +{
> +     struct strbuf buf = STRBUF_INIT;
> +     size_t baselen;
> +     DIR *dir = opendir(path);
> +     struct dirent *de;
> +     int r = 0;
> +
> +     if (!dir)
> +             return opendir_error(path);
> +
> +     strbuf_addstr(&buf, path);
> +     baselen = buf.len;
> +
> +     while ((de = readdir(dir))) {
> +             if (!isxdigit(de->d_name[0]) ||
> +                 !isxdigit(de->d_name[1]) ||
> +                 de->d_name[2])
> +                     continue;

So other files or directories at the $GIT_DIR/objects/ level are just
ignored; they are not considered cruft. This is worth clarifying in the
docstring.

> +
> +             strbuf_addf(&buf, "/%s", de->d_name);
> +             r = for_each_file_in_obj_subdir(&buf, de->d_name, obj_cb,
> +                                             cruft_cb, subdir_cb, data);
> +             strbuf_setlen(&buf, baselen);
> +             if (r)
> +                     break;
> +     }
> +
> +     closedir(dir);
> +     strbuf_release(&buf);
> +     return r;
> +}
> 

Other than my comments above, it looks good to me.

Michael

-- 
Michael Haggerty
[email protected]

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/16] prune: factor out loose-object directory traversal

Reply via email to