Re: [PATCH v4] unpack-trees: avoid duplicate ODB lookups during checkout

Jeff King Fri, 14 Apr 2017 12:52:47 -0700

On Fri, Apr 14, 2017 at 07:25:54PM +0000, [email protected] wrote:

>       for (i = 0; i < n; i++, dirmask >>= 1) {
> -             const unsigned char *sha1 = NULL;
> -             if (dirmask & 1)
> -                     sha1 = names[i].oid->hash;
> -             buf[i] = fill_tree_descriptor(t+i, sha1);
> +             if (i > 0 && are_same_oid(&names[i], &names[i - 1]))
> +                     t[i] = t[i - 1];
> +             else if (i > 1 && are_same_oid(&names[i], &names[i - 2]))
> +                     t[i] = t[i - 2];
> +             else {
> +                     const unsigned char *sha1 = NULL;
> +                     if (dirmask & 1)
> +                             sha1 = names[i].oid->hash;
> +                     buf[nr_buf++] = fill_tree_descriptor(t+i, sha1);
> +             }


This looks fine to me.

Just musing (and I do not think we need to go further than your patch),
we're slowly walking towards an actual object-content cache. The "buf"
array is now essentially a cache of all oids we've loaded, but it
doesn't know its sha1s. So we could actually encapsulate all of the
caching:

  struct object_cache {
          int nr_entries;
          struct object_cache_entry {
                  struct object_id oid;
                  void *data;
          } cache[MAX_UNPACK_TREES];
  };

and then ask it "have you seen oid X" rather than playing games with
looking at "i - 1". Of course it would have to do a linear search, so
the next step is to replace its array with a hashmap.

And now suddenly we have a reusable object-content cache that you could
use like:

  struct object_cache = {0};
  for (...) {
    /* maybe reads fresh, or maybe gets it from the cache */
    void *data = read_object_data_cached(&oid, &cache);
  }
  /* operation done, release the cache */
  clear_object_cache(&cache);

which would work anywhere you expect to load N objects and see some
overlap.

Of course it would be nicer still if this all just happened
automatically behind the scenes of read_object_data(). But it would have
to keep an _extra_ copy of each object, since the caller expects to be
able to free it. We'd probably have to return instead a struct with
buffer/size in it along with a reference counter.

I don't think any of that is worth it unless there are spots where we
really expect there to be a lot of cases where we hit the same objects
in rapid succession. I don't think there should be, though. Our usual
"caching" mechanism is to create a "struct object", which is enough to
perform most operations (and has a much smaller memory footprint).

So again, just musing. I think your patch is fine as-is.

-Peff

Re: [PATCH v4] unpack-trees: avoid duplicate ODB lookups during checkout

Reply via email to