Previously, any time we wanted to read even a single reference from
the `packed-refs` file, we parsed the whole file and stored it in an
elaborate structure in memory called a `ref_cache`. Subsequent
reference lookups or iterations over some or all of the references
could be done by reading from the `ref_cache`.

But for large `packed-refs` files, the time needed to parse the file,
and the memory needed to cache its contents, can be quite significant.
This is partly because building the cache costs lots of little memory
allocations. (And lest you think that most Git commands can be
executed by reading at most a couple of loose references, remember
that almost any command that reads objects has to look for replace
references (unless they are turned off in the config), and *that*
necessarily entails reading packed references.)

Following lots of work to extract the `packed_ref_store` into a
separate module and decouple it from the `files_ref_store`, it is now
possible to fundamentally change how the `packed-refs` file is read.

* `mmap()` the whole file rather than `read()`ing it.

* Instead of parsing the complete file at once into a `ref_cache`,
  parse the references out of the file contents on demand.

* Use a binary search to find, very quickly, the reference or group of
  references that needs to be read. Parse *only* those references out
  of the file contents, without creating in-memory data structures at
  all.

In rare cases this change might force parts of the `packed-refs` file
to be read multiple times, but that cost is far outweighed by the fact
that usually most of the `packed-refs` file doesn't have to be read
*at all*.

Note that the binary search optimization requires the `packed-refs`
file to be sorted by reference name. We have always written them
sorted, but just in case there are clients that don't, we assume the
file is unsorted unless its header lists a `sorted` trait. From now
on, we write the file with that trait. If the file is not sorted, it
is sorted on the fly in memory.

For a repository with only a couple thousand references and a warm
disk cache, this change doesn't make a very significant difference.
But for repositories with very large numbers of references, the
difference start to be significant:

A repository with 54k references (warm cache):

                                  git 2.13.1         this branch
git for-each-ref                      464 ms              452 ms
git for-each-ref (no output)           66 ms               47 ms
git for-each-ref (0.6% of refs)        47 ms                9 ms
git for-each-ref (0.6%, no output)     41 ms                2 ms
git rev-parse                          32 ms                2 ms

A repository (admittedly insane, but a real-life example) with 60M
references (warm cache):

                                  git 2.13.1         this branch
git for-each-ref (no output)        84000 ms            61000 ms
git rev-parse                       40000 ms                2 ms

This branch applies on top of mh/packed-ref-transactions. It can also
be obtained from my git fork [1] as branch `mmap-packed-refs`.

Michael

[1] https://github.com/mhagger/git

Jeff King (1):
  prefix_ref_iterator: break when we leave the prefix

Michael Haggerty (19):
  ref_iterator: keep track of whether the iterator output is ordered
  packed_ref_cache: add a backlink to the associated `packed_ref_store`
  die_unterminated_line(), die_invalid_line(): new functions
  read_packed_refs(): use mmap to read the `packed-refs` file
  read_packed_refs(): only check for a header at the top of the file
  read_packed_refs(): make parsing of the header line more robust
  read_packed_refs(): read references with minimal copying
  packed_ref_cache: remember the file-wide peeling state
  mmapped_ref_iterator: add iterator over a packed-refs file
  mmapped_ref_iterator_advance(): no peeled value for broken refs
  packed_ref_cache: keep the `packed-refs` file open and mmapped
  read_packed_refs(): ensure that references are ordered when read
  packed_ref_iterator_begin(): iterate using `mmapped_ref_iterator`
  packed_read_raw_ref(): read the reference from the mmapped buffer
  ref_store: implement `refs_peel_ref()` generically
  packed_ref_store: get rid of the `ref_cache` entirely
  ref_cache: remove support for storing peeled values
  mmapped_ref_iterator: inline into `packed_ref_iterator`
  packed-backend.c: rename a bunch of things and update comments

 refs.c                |  22 +-
 refs/files-backend.c  |  54 +--
 refs/iterator.c       |  47 ++-
 refs/packed-backend.c | 896 +++++++++++++++++++++++++++++++++++++-------------
 refs/ref-cache.c      |  44 +--
 refs/ref-cache.h      |  35 +-
 refs/refs-internal.h  |  26 +-
 7 files changed, 761 insertions(+), 363 deletions(-)

-- 
2.14.1

Reply via email to