Re: [PATCH 04/19] index-helper: new daemon for caching index and related stuff
Duy Nguyen writes: > Another aspect that's not mentioned is, we keep this daemon's logic as > thin as possible. The "brain" stays in git. So the daemon can read and > validate stuff, but that's about all it's allowed to do. It's not > supposed to add/create new contents. It's not even allowed to accept > direct updates from git. That explanation does make the intent clear. It is a kind of design decision that needs to be made early and that is hard to change later (I am not saying I see the need of changing, though), so it is worth stating explicitly to guide future readers and updaters of the code. Thanks. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/19] index-helper: new daemon for caching index and related stuff
On Thu, Mar 10, 2016 at 6:09 AM, Junio C Hamano wrote: > David Turner writes: > >> From: Nguyễn Thái Ngọc Duy >> >> Instead of reading the index from disk and worrying about disk >> corruption, the index is cached in memory (memory bit-flips happen >> too, but hopefully less often). The result is faster read. Read time >> is reduced by 70%. >> >> The biggest gain is not having to verify the trailing SHA-1, which >> takes lots of time especially on large index files. But this also >> opens doors for further optimiztions: >> >> - we could create an in-memory format that's essentially the memory >>dump of the index to eliminate most of parsing/allocation >>overhead. The mmap'd memory can be used straight away. Experiment >>[1] shows we could reduce read time by 88%. >> >> - we could cache non-index info such as name hash >> >> The shared memory's name folows the template "git--" >> where is the trailing SHA-1 of the index file. is >> "index" for cached index files (and may be "name-hash" for name-hash >> cache). If such shared memory exists, it contains the same index >> content as on disk. The content is already validated by the daemon and >> git won't validate it again (except comparing the trailing SHA-1s). > > This indeed is an interesting approach; what is not explained but > must be is when the on-disk index is updated to reflect the reality > (if I am reading the explanation and the code right, while the > daemon is running, its in-core cache becomes the source of truth by > forcing everybody's read-index-from() to go to the daemon). The > explanation could be "this is only for read side, and updating the > index happens via the traditional 'write a new file and rename it to > the final place' codepath, at which time the daemon must be told to > re-read it." Another aspect that's not mentioned is, we keep this daemon's logic as thin as possible. The "brain" stays in git. So the daemon can read and validate stuff, but that's about all it's allowed to do. It's not supposed to add/create new contents. It's not even allowed to accept direct updates from git. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/19] index-helper: new daemon for caching index and related stuff
On Thu, Mar 10, 2016 at 1:36 AM, David Turner wrote: > Git can poke the daemon to tell it to refresh the index cache, or to > keep it alive some more minutes via UNIX signals. The reason I went with UNIX signals was because it made it possible to make a simple GetMessage loop, the only thing I can remember from my Windows time, on Windows later. It sounded clever, but because this is more like UDP (vs TCP) it's harder for communication. For example, we can't get a confirmation after a request... UNIX sockets would be more natural. Since this patch was written, watchman has gained Windows support. I just looked at the code, it uses named pipe on Windows. So maybe we can just go with that too (if only because it has been proven working in practice) and we can go back to UNIX sockets on the *nix side. Too bad we can't just copy some functions from watchman because of license incompatibility. But we can leave Windows support to gfw team now, I think. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/19] index-helper: new daemon for caching index and related stuff
On Fri, Mar 11, 2016 at 3:22 AM, David Turner wrote: > Reads (ignoring SHA verification) will be slightly slower (due to the > btree overhead). If, in general, we only had to read part of the > index, that would be faster. But a fair amount of git code is written > to assume that it is reasonable to iterate over the whole index. So I > don't see a win from just switching to LMDB without other changes. I didn't think of the btree overhead. Will need to run some test.. I think lmdb just gives us pointers to mmap'd file, so an entire index reading can be fast. We just record where the on-disk data is. We need to make index-on-lmdb use native byte endian though to avoid conversion and actually reading all index entries because of that. > (I guess we could parallelize index deserialization more easily with > lmdb, but I don't know > > The original intent of the index was to be a "staging area" to build up > a tree before committing it. This implies an alternate view of the > index as a diff against some (possibly-empty) tree. An index diff or > status operation, then, just requires iterating over the changes. > > I haven't really dug into merges to understand whether it would be > reasonable for them to use a representation like this, and what that > would do to speed there. I think we keep the same internal representation of the index, a list of paths with extra info like stat and some flags. Minimal impacts to the code base that way. There are some cases (I think) where we just need to read a portion of the index (filtered by pathspec). If we don't update the index afterwards, lmdb partial reads definitely help. I need to check if those cases are common (and worth optimizing for) though. > In general, git takes the commit side of the commit-diff duality. This > is generally a good thing, because commits are simpler to reason about. > But I wonder if we could do better for this specific case. > > But I don't think switching to LMDB would replace index-helper. The point of index-helper is reduce reading cost (let's leave watchman out for now). But it looks like we're not certain if using lmdb can reduce reading cost at lower complexity. I'll give it some thought over the weekend. Maybe in the end we better just go with index-helper... -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/19] index-helper: new daemon for caching index and related stuff
On Thu, 2016-03-10 at 18:17 +0700, Duy Nguyen wrote: > On Thu, Mar 10, 2016 at 6:21 AM, Junio C Hamano > wrote: > > Junio C Hamano writes: > > > > > David Turner writes: > > > > > > > From: Nguyễn Thái Ngọc Duy > > > > > > > > Instead of reading the index from disk and worrying about disk > > > > corruption, the index is cached in memory (memory bit-flips > > > > happen > > > > too, but hopefully less often). The result is faster read. Read > > > > time > > > > is reduced by 70%. > > > > > > > > The biggest gain is not having to verify the trailing SHA-1, > > > > which > > > > takes lots of time especially on large index files. > > > > Come to think of it, wouldn't it be far less intrusive change to > > just put the index on a ramdisk and skip the trailing SHA-1 > > verification, to obtain a similar result with the same trade off > > (i.e. blindly trusting memory instead of being paranoid)? > > I'm straying off-topic again because this reminded me about lmdb :) > For the record when I looked at lmdb I thought of using it as an > index > format too. It has everything we wanted in index-v5. Good enough that > we would not need index-helper and split-index to work around current > index format. Though I finally decided it didn't fit well. > > On the good side, lmdb is b+ trees, we can update directly on disk > (without rewriting whole file), read directly from mmap'd file and > not > worry about corrupting it. There's no SHA-1 verification either (and > we can do crc32 per entry if we want too). It's good. > > But, it does not fit well the our locking model (but we can change > this). And we sometimes want to create temporary throw-away indexes > (e.g. partial commits) which I don't think is easy to do. And the > reading directly from mmap does not give us much because we have to > do > byte endian conversion anyway. > > Hmm.. on second thought, even though lmdb can't be the default format > (for a bunch of other limitations), it can still be an option for > super big worktrees, just like index-helper being an optional > optimization. Hm.. hm.. if we can support lmdb as index format in > addition to the current format, bringing back some work from Thomas.. > We may still need a daemon or something to deal with watchman, but > the > underlying mechanism is exactly the same... David, Junio, what do you > think? LMDB makes writes faster, since we only have to write the stuff that's changed. That's good. Reads (ignoring SHA verification) will be slightly slower (due to the btree overhead). If, in general, we only had to read part of the index, that would be faster. But a fair amount of git code is written to assume that it is reasonable to iterate over the whole index. So I don't see a win from just switching to LMDB without other changes. (I guess we could parallelize index deserialization more easily with lmdb, but I don't know The original intent of the index was to be a "staging area" to build up a tree before committing it. This implies an alternate view of the index as a diff against some (possibly-empty) tree. An index diff or status operation, then, just requires iterating over the changes. I haven't really dug into merges to understand whether it would be reasonable for them to use a representation like this, and what that would do to speed there. In general, git takes the commit side of the commit-diff duality. This is generally a good thing, because commits are simpler to reason about. But I wonder if we could do better for this specific case. But I don't think switching to LMDB would replace index-helper. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/19] index-helper: new daemon for caching index and related stuff
On Thu, Mar 10, 2016 at 6:21 AM, Junio C Hamano wrote: > Junio C Hamano writes: > >> David Turner writes: >> >>> From: Nguyễn Thái Ngọc Duy >>> >>> Instead of reading the index from disk and worrying about disk >>> corruption, the index is cached in memory (memory bit-flips happen >>> too, but hopefully less often). The result is faster read. Read time >>> is reduced by 70%. >>> >>> The biggest gain is not having to verify the trailing SHA-1, which >>> takes lots of time especially on large index files. > > Come to think of it, wouldn't it be far less intrusive change to > just put the index on a ramdisk and skip the trailing SHA-1 > verification, to obtain a similar result with the same trade off > (i.e. blindly trusting memory instead of being paranoid)? I'm straying off-topic again because this reminded me about lmdb :) For the record when I looked at lmdb I thought of using it as an index format too. It has everything we wanted in index-v5. Good enough that we would not need index-helper and split-index to work around current index format. Though I finally decided it didn't fit well. On the good side, lmdb is b+ trees, we can update directly on disk (without rewriting whole file), read directly from mmap'd file and not worry about corrupting it. There's no SHA-1 verification either (and we can do crc32 per entry if we want too). It's good. But, it does not fit well the our locking model (but we can change this). And we sometimes want to create temporary throw-away indexes (e.g. partial commits) which I don't think is easy to do. And the reading directly from mmap does not give us much because we have to do byte endian conversion anyway. Hmm.. on second thought, even though lmdb can't be the default format (for a bunch of other limitations), it can still be an option for super big worktrees, just like index-helper being an optional optimization. Hm.. hm.. if we can support lmdb as index format in addition to the current format, bringing back some work from Thomas.. We may still need a daemon or something to deal with watchman, but the underlying mechanism is exactly the same... David, Junio, what do you think? -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/19] index-helper: new daemon for caching index and related stuff
On Wed, 2016-03-09 at 15:09 -0800, Junio C Hamano wrote: > David Turner writes: > > > From: Nguyễn Thái Ngọc Duy > > > > Instead of reading the index from disk and worrying about disk > > corruption, the index is cached in memory (memory bit-flips happen > > too, but hopefully less often). The result is faster read. Read > > time > > is reduced by 70%. > > > > The biggest gain is not having to verify the trailing SHA-1, which > > takes lots of time especially on large index files. But this also > > opens doors for further optimiztions: > > > > - we could create an in-memory format that's essentially the > > memory > >dump of the index to eliminate most of parsing/allocation > >overhead. The mmap'd memory can be used straight away. > > Experiment > >[1] shows we could reduce read time by 88%. > > > > - we could cache non-index info such as name hash > > > > The shared memory's name folows the template "git--" > > where is the trailing SHA-1 of the index file. is > > "index" for cached index files (and may be "name-hash" for name > > -hash > > cache). If such shared memory exists, it contains the same index > > content as on disk. The content is already validated by the daemon > > and > > git won't validate it again (except comparing the trailing SHA-1s). > > This indeed is an interesting approach; what is not explained but > must be is when the on-disk index is updated to reflect the reality > (if I am reading the explanation and the code right, while the > daemon is running, its in-core cache becomes the source of truth by > forcing everybody's read-index-from() to go to the daemon). The > explanation could be "this is only for read side, and updating the > index happens via the traditional 'write a new file and rename it to > the final place' codepath, at which time the daemon must be told to > re-read it." This seems like the explanation (from the current commit message): "Git can poke the daemon to tell it to refresh the index cache, or to keep it alive some more minutes via UNIX signals. It can't give any real index data directly to the daemon. Real data goes to disk first, then the daemon reads and verifies it from there. Poking only happens for $GIT_DIR/index, not temporary index files." I guess this could be rewritten as: "Index validity is ensured by the following method: When a read is requested from the index-helper, it checks the SHA1 of its cached index copy against the on-disk version. If they differ, index-helper rereads the index. In addition, any git process may explicitly suggest a reread via a UNIX signal, but this is only an optimization and it is not necessary for correctness. In addition, Git can signal the daemon with a heartbeat signal, to keep the daemon alive longer." How does that sound? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/19] index-helper: new daemon for caching index and related stuff
On Wed, 2016-03-09 at 15:21 -0800, Junio C Hamano wrote: > Junio C Hamano writes: > > > David Turner writes: > > > > > From: Nguyễn Thái Ngọc Duy > > > > > > Instead of reading the index from disk and worrying about disk > > > corruption, the index is cached in memory (memory bit-flips > > > happen > > > too, but hopefully less often). The result is faster read. Read > > > time > > > is reduced by 70%. > > > > > > The biggest gain is not having to verify the trailing SHA-1, > > > which > > > takes lots of time especially on large index files. > > Come to think of it, wouldn't it be far less intrusive change to > just put the index on a ramdisk and skip the trailing SHA-1 > verification, to obtain a similar result with the same trade off > (i.e. blindly trusting memory instead of being paranoid)? > 1. If the index were stored on a ramdisk, we would *also* have to write it to durable storage to avoid losing the user's work when they power off. That's more complicated. 2. Duy notes that it is easier to add further optimizations to this -- for instance, we could share a pre-parsed version of the index. It would be harder to add these optimizations to the disk format, because (a) they would take up more space, and (b) they would probably be harder to write in a single pass, which is presently how index writing works. 3. If we wanted to just skip SHA-1 verification, we would not need a ramdisk; it's almost certain that the index would be in the disk cache most of the time, so regular storage should be very nearly as fast as a ramdisk. I think mlock might help ensure that the data remains in the cache, although I'm not sure what the permissions story is on that. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/19] index-helper: new daemon for caching index and related stuff
Junio C Hamano writes: > David Turner writes: > >> From: Nguyễn Thái Ngọc Duy >> >> Instead of reading the index from disk and worrying about disk >> corruption, the index is cached in memory (memory bit-flips happen >> too, but hopefully less often). The result is faster read. Read time >> is reduced by 70%. >> >> The biggest gain is not having to verify the trailing SHA-1, which >> takes lots of time especially on large index files. Come to think of it, wouldn't it be far less intrusive change to just put the index on a ramdisk and skip the trailing SHA-1 verification, to obtain a similar result with the same trade off (i.e. blindly trusting memory instead of being paranoid)? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/19] index-helper: new daemon for caching index and related stuff
David Turner writes: > From: Nguyễn Thái Ngọc Duy > > Instead of reading the index from disk and worrying about disk > corruption, the index is cached in memory (memory bit-flips happen > too, but hopefully less often). The result is faster read. Read time > is reduced by 70%. > > The biggest gain is not having to verify the trailing SHA-1, which > takes lots of time especially on large index files. But this also > opens doors for further optimiztions: > > - we could create an in-memory format that's essentially the memory >dump of the index to eliminate most of parsing/allocation >overhead. The mmap'd memory can be used straight away. Experiment >[1] shows we could reduce read time by 88%. > > - we could cache non-index info such as name hash > > The shared memory's name folows the template "git--" > where is the trailing SHA-1 of the index file. is > "index" for cached index files (and may be "name-hash" for name-hash > cache). If such shared memory exists, it contains the same index > content as on disk. The content is already validated by the daemon and > git won't validate it again (except comparing the trailing SHA-1s). This indeed is an interesting approach; what is not explained but must be is when the on-disk index is updated to reflect the reality (if I am reading the explanation and the code right, while the daemon is running, its in-core cache becomes the source of truth by forcing everybody's read-index-from() to go to the daemon). The explanation could be "this is only for read side, and updating the index happens via the traditional 'write a new file and rename it to the final place' codepath, at which time the daemon must be told to re-read it." -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/19] index-helper: new daemon for caching index and related stuff
From: Nguyễn Thái Ngọc Duy Instead of reading the index from disk and worrying about disk corruption, the index is cached in memory (memory bit-flips happen too, but hopefully less often). The result is faster read. Read time is reduced by 70%. The biggest gain is not having to verify the trailing SHA-1, which takes lots of time especially on large index files. But this also opens doors for further optimiztions: - we could create an in-memory format that's essentially the memory dump of the index to eliminate most of parsing/allocation overhead. The mmap'd memory can be used straight away. Experiment [1] shows we could reduce read time by 88%. - we could cache non-index info such as name hash The shared memory's name folows the template "git--" where is the trailing SHA-1 of the index file. is "index" for cached index files (and may be "name-hash" for name-hash cache). If such shared memory exists, it contains the same index content as on disk. The content is already validated by the daemon and git won't validate it again (except comparing the trailing SHA-1s). Git can poke the daemon to tell it to refresh the index cache, or to keep it alive some more minutes via UNIX signals. It can't give any real index data directly to the daemon. Real data goes to disk first, then the daemon reads and verifies it from there. Poking only happens for $GIT_DIR/index, not temporary index files. $GIT_DIR/index-helper.pid contains a reference to daemon process (and it's pid on *nix). The file's mtime is updated every time it's accessed (or should be updated often enough). Old index-helper.pid is considered stale and ignored. index-helper requires POSIX realtime extension. POSIX shm interface however is abstracted away so that Windows support can be added later. On webkit.git with index format v2, duplicating 8 times to 1.4m entries and 200MB in size: (vanilla) 0.986986364 s: read_index_from .git/index (index-helper) 0.267850279 s: read_index_from .git/index Interestingly with index v4, we get less out of index-helper. It makes sense as v4 requires more processing after loading the index: (vanilla) 0.72249 s: read_index_from .git/index (index-helper) 0.302741500 s: read_index_from .git/index Signed-off-by: Nguyễn Thái Ngọc Duy --- .gitignore | 1 + Documentation/git-index-helper.txt | 44 ++ Makefile | 9 +++ cache.h| 3 + config.mak.uname | 1 + git-compat-util.h | 1 + index-helper.c | 162 + read-cache.c | 106 +--- shm.c | 67 +++ shm.h | 23 ++ 10 files changed, 408 insertions(+), 9 deletions(-) create mode 100644 Documentation/git-index-helper.txt create mode 100644 index-helper.c create mode 100644 shm.c create mode 100644 shm.h diff --git a/.gitignore b/.gitignore index 5087ce1..b92f122 100644 --- a/.gitignore +++ b/.gitignore @@ -71,6 +71,7 @@ /git-http-fetch /git-http-push /git-imap-send +/git-index-helper /git-index-pack /git-init /git-init-db diff --git a/Documentation/git-index-helper.txt b/Documentation/git-index-helper.txt new file mode 100644 index 000..9db28cf --- /dev/null +++ b/Documentation/git-index-helper.txt @@ -0,0 +1,44 @@ +git-index-helper(1) +=== + +NAME + +git-index-helper - A simple cache daemon for speeding up index file access + +SYNOPSIS + +[verse] +'git index-helper' [options] + +DESCRIPTION +--- +Keep the index file in memory for faster access. This daemon is per +repository. + +OPTIONS +--- + +--exit-after=:: + Exit if the cached index is not accessed for `` + minutes. Specify 0 to wait forever. Default is 10. + +NOTES +- +On UNIX-like systems, $GIT_DIR/index-helper.pid contains the process +id of the daemon. At least on Linux, shared memory objects are +availble via /dev/shm with the name pattern "git--". +Normally the daemon will clean up shared memory objects when it exits. +But if it crashes, some objects could remain there and they can be +safely deleted with "rm" command. The following signals are used to +control the daemon: + +SIGHUP:: + Reread the index. + +SIGUSR1:: + Let the daemon know the index is to be read. It keeps the + daemon alive longer, unless `--exit-after=0` is used. + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/Makefile b/Makefile index 24bef8d..a6c668b 100644 --- a/Makefile +++ b/Makefile @@ -367,6 +367,8 @@ all:: # Define HAVE_BSD_SYSCTL if your platform has a BSD-compatible sysctl function. # # Define HAVE_GETDELIM if your system has the getdelim() function. +# +# Define HAVE_SHM if you platform support shm_* functions in librt. GIT-VERSION-FILE: FORCE @$(SHELL_PATH) ./GIT-VERSION-GEN @@ -802,6 +804,7 @@ LI