Re: [PATCH v2 00/21] Read `packed-refs` using mmap()

2017-09-25 Thread Johannes Schindelin
Hi Peff,


On Wed, 20 Sep 2017, Jeff King wrote:

> On Tue, Sep 19, 2017 at 08:22:08AM +0200, Michael Haggerty wrote:
> 
> > This branch is also available from my fork on GitHub as branch
> > `mmap-packed-refs`.
> > 
> > My main uncertainties are:
> > 
> > 1. Does this code actually work on Windows?
> > 
> 
> I can't really answer (1) due to lack of knowledge.

Sorry, I have only a couple of minutes per day to look through the flood
of emails from the Git mailing list and to answer them, so I forgot to say
that I had a VM build and test Michael's patches, and they worked on
Windows.

Ciao,
Dscho


Re: [PATCH v2 00/21] Read `packed-refs` using mmap()

2017-09-20 Thread Jeff King
On Tue, Sep 19, 2017 at 08:22:08AM +0200, Michael Haggerty wrote:

> This branch is also available from my fork on GitHub as branch
> `mmap-packed-refs`.
> 
> My main uncertainties are:
> 
> 1. Does this code actually work on Windows?
> 
> 2. Did I implement the new compile-time option correctly? (I just
>cargo-culted some existing options.) Is there some existing option
>that I could piggy-back off of instead of adding a new one?
> 
> 3. Is a compile-time option sufficient, or would the `mmap()` option
>need to be configurable at runtime, or even tested at repository
>create time as is done for some other filesystem properties in
>`create_default_files()`?

I can't really answer (1) due to lack of knowledge. For (2), I think
it's mostly good, though I raised a small nit. For (3), I think this is
really an OS restriction and not a filesystem one, so probably
build-time is OK (though of course Windows people may be able to correct
me).

I left a few comments, but overall it looks quite good to me.

-Peff


Re: [PATCH v2 00/21] Read `packed-refs` using mmap()

2017-09-19 Thread Johannes Schindelin
Hi Michael,

On Tue, 19 Sep 2017, Michael Haggerty wrote:

> This is v2 of a patch series that changes the reading and caching of the
> `packed-refs` file to use `mmap()`. Thanks to Junio, Stefan, and
> Johannes for their comments about v1 [1].

Thank you for the new iteration.

> The main change since v1 is to accommodate Windows, which doesn't let
> you replace a file using `rename()` if the file is currently mmapped.
> This is unfortunate, because it means that Windows will never get the
> O(N) → O(lg N) improvement for reading single references that more
> capable systems can now enjoy.

Triggered by your enquiry, I looked into passing the FILE_SHARE_DELETE
flag which I hoped would let us delete the file even if it still is open
(and mapped). In my tests, this did not work. If anybody wants to have a
look at what I did (and whether they can make it work):
https://github.com/dscho/git/tree/replace-wopen

> The background was discussed on the mailing list [2]. The bottom line
> is that on Windows, keeping the `packed-refs` lock mmapped would be
> tantamount to holding reader lock on that file, preventing anybody
> (even unrelated processes) from changing the `packed-refs` file while
> it is mmapped. This is even worse than the situation for packfiles
> (which is solved using `close_all_packs()`), because a packfile, once
> created, never needs to be replaced—every packfile has a filename that
> is determined from its contents. The worst that can happen if a
> packfile is locked is that another process cannot remove it, but that
> is not critical for correctness. The `packed-refs` file, on the other
> hand, always has the same filename and needs to be overwritten for
> correctness.
> 
> So the approach taken here is that a new compile-time option,
> `MMAP_PREVENTS_DELETE`, is introduced. When this option is set, then
> the `packed-refs` file is read quickly into memory then closed.

Another approach would be to imitate close_all_packs() and rely on the
Windows-specific code that retries renames in a staggered fashion, waiting
a little longer and longer before retrying, and finally telling the user
that some file cannot be overwritten:
https://github.com/git-for-windows/git/blob/v2.14.1.windows.1/compat/mingw.c#L2439-L2441

This is not a new problem, by the way. If a file is in use while you try
to run `git checkout` with a different version of that file, we have the
exact same problem on Windows. And we deal with it using that
retry_ask_yes_no() function.

For this to work, the current process really would need to be able to
release all snapshots in one go (for simplicity, I would not even check
the filename but simply blow them all away when we want to overwrite
packed-refs).

I guess I should set aside some time to implement that on top of your
series (I *really* want our in-house users to benefit from that O(lg n)
improvement). In the meantime, I think this can go forward with the
current design.

Ciao,
Dscho

[PATCH v2 00/21] Read `packed-refs` using mmap()

2017-09-19 Thread Michael Haggerty
This is v2 of a patch series that changes the reading and caching of
the `packed-refs` file to use `mmap()`. Thanks to Junio, Stefan, and
Johannes for their comments about v1 [1].

The main change since v1 is to accommodate Windows, which doesn't let
you replace a file using `rename()` if the file is currently mmapped.
This is unfortunate, because it means that Windows will never get the
O(N) → O(lg N) improvement for reading single references that more
capable systems can now enjoy.

The background was discussed on the mailing list [2]. The bottom line
is that on Windows, keeping the `packed-refs` lock mmapped would be
tantamount to holding reader lock on that file, preventing anybody
(even unrelated processes) from changing the `packed-refs` file while
it is mmapped. This is even worse than the situation for packfiles
(which is solved using `close_all_packs()`), because a packfile, once
created, never needs to be replaced—every packfile has a filename that
is determined from its contents. The worst that can happen if a
packfile is locked is that another process cannot remove it, but that
is not critical for correctness. The `packed-refs` file, on the other
hand, always has the same filename and needs to be overwritten for
correctness.

So the approach taken here is that a new compile-time option,
`MMAP_PREVENTS_DELETE`, is introduced. When this option is set, then
the `packed-refs` file is read quickly into memory then closed.

Even in that case, though, this branch brings significant performance
benefits, because instead of parsing the whole file and storing it
into lots of little objects in a `ref_cache` (which also involves a
lot of small memory allocations), we copy the verbatim contents of the
file into memory. Then we use the same binary search techniques to
find any references that we need to read, just as we would do if the
file were memory mapped. This means that we only have to fully parse
the references that we are interested in, and hardly have to allocate
any additional memory.

I did some more careful benchmarks of this code vs. Git 2.14.1 on a
repository that is not quite as pathological. The test repo has 110k
references that are fully packed in a `packed-refs` file that has the
`sorted` trait. The current version is compiled three ways:

* `NO_MMAP=YesPlease`—prevents all use of `mmap()`. This variant is
  O(N) when reading a single reference.

* `MMAP_PREVENTS_DELETE=YesPlease`—uses mmap for the initial read, but
  quickly copies the contents to heap-allocated memory and munmaps
  right away. This variant is also O(N) when reading a single
  reference.

* default (mmap enabled)—the `packed-refs` file is kept mmapped for as
  long as it is in use.

The commands that I timed were as follows:

# for-each-ref, warm cache:
$ git -C lots-of-refs for-each-ref --format="%(objectname) %(refname)" 
>/dev/null

# rev-parse, warm cache (this command was run 10 times then the total
# time divided by 10):
$ git -C lots-of-refs rev-parse --verify refs/remotes/origin/pr/38733

# rev-parse, cold cache (but git binary warm):
$ sync ; sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'; git rev-parse v1.0.0; 
time git -C lots-of-refs rev-parse --verify refs/remotes/origin/pr/38733

(Note that the `rev-parse` commands involve a handful of reference
lookups as the argument is DWIMmed.)

Results:

   for-each-ref rev-parse rev-parse
 warm cachewarm cachecold cache
   ----
v2.14.1   92 ms   23.7 ms 30 ms
NO_MMAP=YesPlease 83 ms3.4 ms 10 ms
MMAP_PREVENTS_DELETE=YesPlease82 ms3.5 ms 11 ms
default (mmap enabled)81 ms0.8 ms  6 ms

So this branch is a little bit faster at iterating over all
references, but it really has big advantages when looking up single
references. The advantage is smaller if `NO_MMAP` or
`MMAP_PREVENTS_DELETE` is set, but is still quite significant.

This branch is also available from my fork on GitHub as branch
`mmap-packed-refs`.

My main uncertainties are:

1. Does this code actually work on Windows?

2. Did I implement the new compile-time option correctly? (I just
   cargo-culted some existing options.) Is there some existing option
   that I could piggy-back off of instead of adding a new one?

3. Is a compile-time option sufficient, or would the `mmap()` option
   need to be configurable at runtime, or even tested at repository
   create time as is done for some other filesystem properties in
   `create_default_files()`?

Michael

[1] https://public-inbox.org/git/cover.1505319366.git.mhag...@alum.mit.edu/
[2] 
https://public-inbox.org/git/alpine.DEB.2.21.1.1709142101560.4132@virtualbox/

https://public-inbox.org/git/alpine.DEB.2.21.1.1709150012550.219280@virtualbox/
[3] https://github.com/mhagger/git

Jeff King (1):