Previously, any time we wanted to read even a single reference from
the `packed-refs` file, we parsed the whole file and stored it in an
elaborate structure in memory called a `ref_cache`. Subsequent
reference lookups or iterations over some or all of the references
could be done by reading from the `ref_cache`.
But for large `packed-refs` files, the time needed to parse the file,
and the memory needed to cache its contents, can be quite significant.
This is partly because building the cache costs lots of little memory
allocations. (And lest you think that most Git commands can be
executed by reading at most a couple of loose references, remember
that almost any command that reads objects has to look for replace
references (unless they are turned off in the config), and *that*
necessarily entails reading packed references.)
Following lots of work to extract the `packed_ref_store` into a
separate module and decouple it from the `files_ref_store`, it is now
possible to fundamentally change how the `packed-refs` file is read.
* `mmap()` the whole file rather than `read()`ing it.
* Instead of parsing the complete file at once into a `ref_cache`,
parse the references out of the file contents on demand.
* Use a binary search to find, very quickly, the reference or group of
references that needs to be read. Parse *only* those references out
of the file contents, without creating in-memory data structures at
all.
In rare cases this change might force parts of the `packed-refs` file
to be read multiple times, but that cost is far outweighed by the fact
that usually most of the `packed-refs` file doesn't have to be read
*at all*.
Note that the binary search optimization requires the `packed-refs`
file to be sorted by reference name. We have always written them
sorted, but just in case there are clients that don't, we assume the
file is unsorted unless its header lists a `sorted` trait. From now
on, we write the file with that trait. If the file is not sorted, it
is sorted on the fly in memory.
For a repository with only a couple thousand references and a warm
disk cache, this change doesn't make a very significant difference.
But for repositories with very large numbers of references, the
difference start to be significant:
A repository with 54k references (warm cache):
git 2.13.1 this branch
git for-each-ref 464 ms 452 ms
git for-each-ref (no output) 66 ms 47 ms
git for-each-ref (0.6% of refs) 47 ms 9 ms
git for-each-ref (0.6%, no output) 41 ms 2 ms
git rev-parse 32 ms 2 ms
A repository (admittedly insane, but a real-life example) with 60M
references (warm cache):
git 2.13.1 this branch
git for-each-ref (no output) 84000 ms 61000 ms
git rev-parse 40000 ms 2 ms
This branch applies on top of mh/packed-ref-transactions. It can also
be obtained from my git fork [1] as branch `mmap-packed-refs`.
Michael
[1] https://github.com/mhagger/git
Jeff King (1):
prefix_ref_iterator: break when we leave the prefix
Michael Haggerty (19):
ref_iterator: keep track of whether the iterator output is ordered
packed_ref_cache: add a backlink to the associated `packed_ref_store`
die_unterminated_line(), die_invalid_line(): new functions
read_packed_refs(): use mmap to read the `packed-refs` file
read_packed_refs(): only check for a header at the top of the file
read_packed_refs(): make parsing of the header line more robust
read_packed_refs(): read references with minimal copying
packed_ref_cache: remember the file-wide peeling state
mmapped_ref_iterator: add iterator over a packed-refs file
mmapped_ref_iterator_advance(): no peeled value for broken refs
packed_ref_cache: keep the `packed-refs` file open and mmapped
read_packed_refs(): ensure that references are ordered when read
packed_ref_iterator_begin(): iterate using `mmapped_ref_iterator`
packed_read_raw_ref(): read the reference from the mmapped buffer
ref_store: implement `refs_peel_ref()` generically
packed_ref_store: get rid of the `ref_cache` entirely
ref_cache: remove support for storing peeled values
mmapped_ref_iterator: inline into `packed_ref_iterator`
packed-backend.c: rename a bunch of things and update comments
refs.c | 22 +-
refs/files-backend.c | 54 +--
refs/iterator.c | 47 ++-
refs/packed-backend.c | 896 +++++++++++++++++++++++++++++++++++++-------------
refs/ref-cache.c | 44 +--
refs/ref-cache.h | 35 +-
refs/refs-internal.h | 26 +-
7 files changed, 761 insertions(+), 363 deletions(-)
--
2.14.1