Re: Watchman support for git
On Fri, May 16, 2014 at 2:42 AM, David Turner wrote: >> I assume you won't change your mind about this. Which is fine to me. I >> will still try out my approach with your libwatchman though. Just >> curious about its performance and complexity, compared to your >> approach. > > I am open-minded here. This code is really the first time I have looked > at git's internals, and I accept that your way might be better. If > you're going to try the watchman version of your approach, then we do a > direct comparison. Let me know if there is something I can do to help > out on that. You already helped by publishing your patches (and letting me know about libwatchman) so I can steal bits here and there ;-) >> A bit off topic, but msys fork has another "fscache" in compat/win32. >> If you could come up with a different name, maybe it'll be less >> confusing for them after merging. But this is not a big deal, as this >> fscache won't work on windows anyway. > > Does wtcache sounds like a better name? > Heh i'm bad at naming. But that sounds a bit cryptic. Maybe watchman-cache.c (with funtions starting with prefix wmc_)? Should not worry too much about this though, unless some msysgit guys yell up. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Wed, 2014-05-14 at 17:36 +0700, Duy Nguyen wrote: > >> With that in > >> mind, I think you don't need to keep a fs cache on disk at all. All > >> you need to store (in the index) is the clock value from watchman. > >> After we parse the index, we perform a "since" query to get path names > >> (and perhaps "exists" and "mode" for later). Then we set CE_VALID bit > >> on entries that are _not_ in the query result. The remaining items > >> will be lstat'd by git (see [1] and read-cache.c changes on the next > >> few patches). Assuming the number of updated files is reasonably > >> small, we won't be punished by lookup time. > > > > I considered this, but rejected it for a few reasons: > > 1. We still need to know about the untracked files. I guess we could > > do an index extension for just that, like your untracked cache. > > Yes. But consider that the number of untracked files is usually small > (otherwise 'git status' would look very messy). And your fscache would > need to store excluded file list too, which could be a lot bigger (one > pair of .[ch] -> one .o). _But_ yours would make 'git status > --ignored' work. I don't consider that a major use case for > optimization, but people may have different opinions. I don't think --ignored is a major use case. But I do think it's nice to get as much mileage as possible out of a change like this. > > 2. That doesn't eliminate opendir/readdir, I think. Those are a major > > cost. I did not thoroughly review your patches from January/February, > > but I didn't notice anything in there about this part of dir.c. > > Also the cost of open(nonexistent .gitignore) to do ignore processing. > > Assuming untracked cache is in use, opendir/readdir is replaced with > lstat. And cheap lstat can be solved with watchman without storing > anything extra. I solve open(.gitignore) too by checking its stat data > with the one in index. If matches, I reuse the version in tree. This > does not necessarily make it cheaper, but it increases locality so it > might be. _Modified_ .gitignore files have to go through > open(.gitignore), but people don't update .gitignore often. Interesting -- if all that actually works, then it's probably the right approach. > > 3. Changing a file in the index (e.g. git add) requires clearing > > the CE_VALID bit; this means that third-party libraries (e.g. jgit) > > that change the git repo need to understand this extension for correct > > operation. Maybe that's just the nature of extensions, but it's not > > something that my present patch set requires. > > I don't store CE_VALID on disk. Instead I store a new bit CE_WATCHED. > Only after loading and validating against watchman that I turn > CE_WATCHED to CE_VALID in memory. So JGit, libgit2 are not confused. > > I assume you won't change your mind about this. Which is fine to me. I > will still try out my approach with your libwatchman though. Just > curious about its performance and complexity, compared to your > approach. I am open-minded here. This code is really the first time I have looked at git's internals, and I accept that your way might be better. If you're going to try the watchman version of your approach, then we do a direct comparison. Let me know if there is something I can do to help out on that. > A bit off topic, but msys fork has another "fscache" in compat/win32. > If you could come up with a different name, maybe it'll be less > confusing for them after merging. But this is not a big deal, as this > fscache won't work on windows anyway. Does wtcache sounds like a better name? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
One more thing I think we can solve with watchman is the racy timestamp issue (see 29e4d36 and 407c8eb). Basically if a file is updated within a second (and the system only supports timestamp granuarity down to a second) then git can't know if a file is changed by comparing stat data, so it compares content anyway. With watchman, I think we know that a file is updated or not and can disable racy test. How often this happens and whether it is worth supporting is something to think about. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Wed, May 14, 2014 at 6:44 AM, David Turner wrote: >> I'm not so happy that git now needs to link to libjansson.so and >> libwatchman.so. I would load libwatchman.so at runtime only when >> core.usewatchman is on, but this is more of personal taste. > > I assume you mean something with dlopen? I don't really like that because > (a) nothing else in git works that way and > (b) you would still need the libwatchman headers to build git (or the > structs could be copied, but that is ugly). And because we don't link to libdl.so anyway, using dlopen means trading libjansson.so and libwatchman.so for libdl.so. And it's uglier. So forget what I wrote. >> With that in >> mind, I think you don't need to keep a fs cache on disk at all. All >> you need to store (in the index) is the clock value from watchman. >> After we parse the index, we perform a "since" query to get path names >> (and perhaps "exists" and "mode" for later). Then we set CE_VALID bit >> on entries that are _not_ in the query result. The remaining items >> will be lstat'd by git (see [1] and read-cache.c changes on the next >> few patches). Assuming the number of updated files is reasonably >> small, we won't be punished by lookup time. > > I considered this, but rejected it for a few reasons: > 1. We still need to know about the untracked files. I guess we could > do an index extension for just that, like your untracked cache. Yes. But consider that the number of untracked files is usually small (otherwise 'git status' would look very messy). And your fscache would need to store excluded file list too, which could be a lot bigger (one pair of .[ch] -> one .o). _But_ yours would make 'git status --ignored' work. I don't consider that a major use case for optimization, but people may have different opinions. > 2. That doesn't eliminate opendir/readdir, I think. Those are a major > cost. I did not thoroughly review your patches from January/February, > but I didn't notice anything in there about this part of dir.c. > Also the cost of open(nonexistent .gitignore) to do ignore processing. Assuming untracked cache is in use, opendir/readdir is replaced with lstat. And cheap lstat can be solved with watchman without storing anything extra. I solve open(.gitignore) too by checking its stat data with the one in index. If matches, I reuse the version in tree. This does not necessarily make it cheaper, but it increases locality so it might be. _Modified_ .gitignore files have to go through open(.gitignore), but people don't update .gitignore often. > 3. Changing a file in the index (e.g. git add) requires clearing > the CE_VALID bit; this means that third-party libraries (e.g. jgit) > that change the git repo need to understand this extension for correct > operation. Maybe that's just the nature of extensions, but it's not > something that my present patch set requires. I don't store CE_VALID on disk. Instead I store a new bit CE_WATCHED. Only after loading and validating against watchman that I turn CE_WATCHED to CE_VALID in memory. So JGit, libgit2 are not confused. I assume you won't change your mind about this. Which is fine to me. I will still try out my approach with your libwatchman though. Just curious about its performance and complexity, compared to your approach. A bit off topic, but msys fork has another "fscache" in compat/win32. If you could come up with a different name, maybe it'll be less confusing for them after merging. But this is not a big deal, as this fscache won't work on windows anyway. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Sat, 2014-05-10 at 15:16 +0700, Duy Nguyen wrote: > On Sat, May 3, 2014 at 6:14 AM, wrote: > > The most sigificant patch uses Facebook's watchman daemon[1] to monitor > > the repository work tree for changes. This makes allows git status > > to avoid traversing the entire work tree to find changes. > > Some comments on this series. I still haven't been able to run it with > watchman so there are many guesses from my side. > > First of all, when I set USE_WATCHMAN=Yes in config.mak I expect it to > work out of the box, provided that all dependencies are installed. I > still need to set WATCHMAN_LIBS for it to build because you only set > it with configure script. Would be nice to have a default value for > non-configure people too. I'll fix that, thanks. > I'm not so happy that git now needs to link to libjansson.so and > libwatchman.so. I would load libwatchman.so at runtime only when > core.usewatchman is on, but this is more of personal taste. I assume you mean something with dlopen? I don't really like that because (a) nothing else in git works that way and (b) you would still need the libwatchman headers to build git (or the structs could be copied, but that is ugly). > I still prefer my old tracking model, where the majority of lstat() is > done by refresh operation and we only need to optimize those lstat > calls, not every single lstat statement in the code base. The lstat calls are only one of the problems. The others are opendir/readdir and open(.gitignore). > With that in > mind, I think you don't need to keep a fs cache on disk at all. All > you need to store (in the index) is the clock value from watchman. > After we parse the index, we perform a "since" query to get path names > (and perhaps "exists" and "mode" for later). Then we set CE_VALID bit > on entries that are _not_ in the query result. The remaining items > will be lstat'd by git (see [1] and read-cache.c changes on the next > few patches). Assuming the number of updated files is reasonably > small, we won't be punished by lookup time. I considered this, but rejected it for a few reasons: 1. We still need to know about the untracked files. I guess we could do an index extension for just that, like your untracked cache. 2. That doesn't eliminate opendir/readdir, I think. Those are a major cost. I did not thoroughly review your patches from January/February, but I didn't notice anything in there about this part of dir.c. Also the cost of open(nonexistent .gitignore) to do ignore processing. 3. Changing a file in the index (e.g. git add) requires clearing the CE_VALID bit; this means that third-party libraries (e.g. jgit) that change the git repo need to understand this extension for correct operation. Maybe that's just the nature of extensions, but it's not something that my present patch set requires. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Wed, 2014-05-14 at 05:54 +0700, Duy Nguyen wrote: > On Wed, May 14, 2014 at 5:38 AM, David Turner > wrote: > > On Mon, 2014-05-12 at 17:45 +0700, Duy Nguyen wrote: > >> This is your quote from above, moved down a bit: > >> > >> > update_fs_cache should only have to update based on what it has learned > >> > from watchman. So if no .gitignore has been changed, it should not have > >> > to do very much work. > >> > > >> > I could take the fe_excluded check and move it above the > >> > last_exclude_matching check in fs_cache_is_excluded; it causes t7300 to > >> > fail when run under watchman but presumably that's fixable > >> > >> So you exclude files early and make the real read_directory() pass do > >> pretty much nothing. This is probably not a good idea. Assume that I > >> touch $TOP/.gitignore then do something other than "git status" (or > >> "git add") then I have to pay read_directory() cost. > > > > I'm not sure I understand this. read_directory does something: it checks > > the fs_cache (instead of the filesystem) for untracked files. > > A lot of commands do read_cache() that that eventually calls > update_fs_cache, which does part of read_directory's work (the > fe_excluded thing). But not many of those commands actually call > read_directory(). It'd better if there's a way to mark that "this > .gitignore is changed", but delay the actual exclude processsing until > we are sure read_directory() will be used. OK, that would be straighforward, I think. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Wed, May 14, 2014 at 5:38 AM, David Turner wrote: > On Mon, 2014-05-12 at 17:45 +0700, Duy Nguyen wrote: >> This is your quote from above, moved down a bit: >> >> > update_fs_cache should only have to update based on what it has learned >> > from watchman. So if no .gitignore has been changed, it should not have >> > to do very much work. >> > >> > I could take the fe_excluded check and move it above the >> > last_exclude_matching check in fs_cache_is_excluded; it causes t7300 to >> > fail when run under watchman but presumably that's fixable >> >> So you exclude files early and make the real read_directory() pass do >> pretty much nothing. This is probably not a good idea. Assume that I >> touch $TOP/.gitignore then do something other than "git status" (or >> "git add") then I have to pay read_directory() cost. > > I'm not sure I understand this. read_directory does something: it checks > the fs_cache (instead of the filesystem) for untracked files. A lot of commands do read_cache() that that eventually calls update_fs_cache, which does part of read_directory's work (the fe_excluded thing). But not many of those commands actually call read_directory(). It'd better if there's a way to mark that "this .gitignore is changed", but delay the actual exclude processsing until we are sure read_directory() will be used. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Mon, 2014-05-12 at 17:45 +0700, Duy Nguyen wrote: > This is your quote from above, moved down a bit: > > > update_fs_cache should only have to update based on what it has learned > > from watchman. So if no .gitignore has been changed, it should not have > > to do very much work. > > > > I could take the fe_excluded check and move it above the > > last_exclude_matching check in fs_cache_is_excluded; it causes t7300 to > > fail when run under watchman but presumably that's fixable > > So you exclude files early and make the real read_directory() pass do > pretty much nothing. This is probably not a good idea. Assume that I > touch $TOP/.gitignore then do something other than "git status" (or > "git add") then I have to pay read_directory() cost. I'm not sure I understand this. read_directory does something: it checks the fs_cache (instead of the filesystem) for untracked files. > Back to the open vs fs_cache_open and the number of .gitignore files > above. I touch $TOP/.gitignore then do "git status" to make it read > all .gitignore files (6k of them) and change between open and > fs_cache_open. I think the numbers still do not make any visible > difference (~1620-1630ms). Yes, I would expect no win in that case. fs_cache_open will only save time in the common case where there is no .gitignore file, because it saves an open() call. If every possible .gitignore file exists, of course it makes no difference. But also, your processor may be sufficiently slow that the context-switch penalty for open() is less than the hash table lookup. For me, the win from fs_cache_open is about 7%. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Mon, May 12, 2014 at 5:56 AM, David Turner wrote: >> So without watchman I got >> >>299.871ms read_index_from:1538 if (verify_hdr(hdr, mmap_size) < 0) go >>498.205ms cmd_status:1300 refresh_index(&the_index, REFRESH_QUIE >>796.050ms wt_status_collect:622 wt_status_collect_untracked(s) >> >> and with watchman ("git status" ran several times to make sure it's cached) >> >>301.950ms read_index_from:1538 if (verify_hdr(hdr, mmap_size) < 0) go >> 34.918ms read_fs_cache:347 if (verify_hdr(hdr, mmap_size) < 0) go >> 1564.096ms watchman_load_fs_cache:628 update_fs_cache(istate, result); >>161.930ms cmd_status:1300 refresh_index(&the_index, REFRESH_QUIE >>251.614ms wt_status_collect:622 wt_status_collect_untracked(s) >> >> Given the total time of "git status" without watchman is 1.9s,, >> update_fs_cache() nearly matches that number alone. All that is spent >> in the exclude update code in the function, but if you do >> last_excluding_matching() anyway, why cache it? > > My numbers are different (for my test repository): > > --- > 30.031ms read_index:1386 r = read_index_from(istate, get_index_ > 71.625ms cmd_status:1302 refresh_index(&the_index, REFRESH_QUIE >259.712ms wt_status_collect:622 wt_status_collect_untracked(s) > > 41.110ms read_index:1386 r = read_index_from(istate, get_index_ > 9.294ms read_fs_cache:347 if (verify_hdr(hdr, mmap_size) < 0) go > 0.173ms watchman_load_fs_cache:628 update_fs_cache(istate, result) > 41.901ms read_index:1386 r = read_index_from(istate, get_index_ > 18.355ms cmd_status:1302 refresh_index(&the_index, REFRESH_QUIE > 50.911ms wt_status_collect:622 wt_status_collect_untracked(s) > --- > > I think something must be going wrong with update_fs_cache on your > machine. I have a few hypotheses: > > 1. Maybe watchman isn't fully started up when you run your tests. > 2. Maybe there is a bug. It's probably me doing something wrong (I ran it more than a couple times so watchman must have loaded the whole thing). I got small numbers in update_fs_cache() now. >> A bit surprised about wt_status_collect_untracked number. I verified >> that last_excluding_matching() still runs (on the same number of >> entries like in no-watchman case). Replacing fs_cache_open() in >> add_excludes_from_file_to_list() to plain open() does not change the >> number, so we probably won't need that (unless your worktree is filled >> with .gitignore, which I doubt it's a norm). > > My test repo has a couple hundred of them. Maybe that's unusual? A > repo with a lot of projects will tend to have lots of gitignore files, > because each project will want to maintain them independently. I tried the worst case, every directory had an empty .gitignore. The numbers did not change much. And I found out that because add_excludes.. were called twice, not on every .gitignore because of the condition "!(dir->flags & DIR_EXCLUDE_CMDL_ONLY)". So the number of .gitignore does not matter (yet). This is your quote from above, moved down a bit: > update_fs_cache should only have to update based on what it has learned > from watchman. So if no .gitignore has been changed, it should not have > to do very much work. > > I could take the fe_excluded check and move it above the > last_exclude_matching check in fs_cache_is_excluded; it causes t7300 to > fail when run under watchman but presumably that's fixable So you exclude files early and make the real read_directory() pass do pretty much nothing. This is probably not a good idea. Assume that I touch $TOP/.gitignore then do something other than "git status" (or "git add") then I have to pay read_directory() cost. Back to the open vs fs_cache_open and the number of .gitignore files above. I touch $TOP/.gitignore then do "git status" to make it read all .gitignore files (6k of them) and change between open and fs_cache_open. I think the numbers still do not make any visible difference (~1620-1630ms). -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Sun, 2014-05-11 at 07:21 +0700, Duy Nguyen wrote: > On Sun, May 11, 2014 at 1:38 AM, David Turner > wrote: > >> I got "warning: Watchman watch error: Got bad JSON from watchman > >> get-sockname: '[' or '{' expected near end of file". Any ideas what I > >> did wrong? I'm using watchman.git and libwatchman.git. check-0.9.11 > >> and jansson-2.4 were installed by system (gentoo). > > > > What do you get from watchman get-sockname on the command-line? Do the > > watchman tests pass? > > Found the problem. "watchman" binary is not in $PATH but popen() did > not complain (or it did but your "2>/dev/null" in watchman_connect > suppressed it). I should probably not be using popen, since it doesn't offer good error reporting. I'll try to fix that in the next few days. > BTW you need to update the array size of "expressions" > in test_watchman_misc(). Thanks, fixed. > So without watchman I got > >299.871ms read_index_from:1538 if (verify_hdr(hdr, mmap_size) < 0) go >498.205ms cmd_status:1300 refresh_index(&the_index, REFRESH_QUIE >796.050ms wt_status_collect:622 wt_status_collect_untracked(s) > > and with watchman ("git status" ran several times to make sure it's cached) > >301.950ms read_index_from:1538 if (verify_hdr(hdr, mmap_size) < 0) go > 34.918ms read_fs_cache:347 if (verify_hdr(hdr, mmap_size) < 0) go > 1564.096ms watchman_load_fs_cache:628 update_fs_cache(istate, result); >161.930ms cmd_status:1300 refresh_index(&the_index, REFRESH_QUIE >251.614ms wt_status_collect:622 wt_status_collect_untracked(s) > > Given the total time of "git status" without watchman is 1.9s,, > update_fs_cache() nearly matches that number alone. All that is spent > in the exclude update code in the function, but if you do > last_excluding_matching() anyway, why cache it? My numbers are different (for my test repository): --- 30.031ms read_index:1386 r = read_index_from(istate, get_index_ 71.625ms cmd_status:1302 refresh_index(&the_index, REFRESH_QUIE 259.712ms wt_status_collect:622 wt_status_collect_untracked(s) 41.110ms read_index:1386 r = read_index_from(istate, get_index_ 9.294ms read_fs_cache:347 if (verify_hdr(hdr, mmap_size) < 0) go 0.173ms watchman_load_fs_cache:628 update_fs_cache(istate, result) 41.901ms read_index:1386 r = read_index_from(istate, get_index_ 18.355ms cmd_status:1302 refresh_index(&the_index, REFRESH_QUIE 50.911ms wt_status_collect:622 wt_status_collect_untracked(s) --- I think something must be going wrong with update_fs_cache on your machine. I have a few hypotheses: 1. Maybe watchman isn't fully started up when you run your tests. 2. Maybe there is a bug. update_fs_cache should only have to update based on what it has learned from watchman. So if no .gitignore has been changed, it should not have to do very much work. I could take the fe_excluded check and move it above the last_exclude_matching check in fs_cache_is_excluded; it causes t7300 to fail when run under watchman but presumably that's fixable > I think we're paying lookup cost in refresh_index(). I forced CE_VALID > bit on as an experiment and refresh_index() went down to 33ms. Yes, it is possible to use Watchman to set CE_VALID as well, and I should probably do that. It's a bit more complicated (at least, I recall running into problems), but probably doable. > A bit surprised about wt_status_collect_untracked number. I verified > that last_excluding_matching() still runs (on the same number of > entries like in no-watchman case). Replacing fs_cache_open() in > add_excludes_from_file_to_list() to plain open() does not change the > number, so we probably won't need that (unless your worktree is filled > with .gitignore, which I doubt it's a norm). My test repo has a couple hundred of them. Maybe that's unusual? A repo with a lot of projects will tend to have lots of gitignore files, because each project will want to maintain them independently. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Sun, May 11, 2014 at 1:38 AM, David Turner wrote: >> I got "warning: Watchman watch error: Got bad JSON from watchman >> get-sockname: '[' or '{' expected near end of file". Any ideas what I >> did wrong? I'm using watchman.git and libwatchman.git. check-0.9.11 >> and jansson-2.4 were installed by system (gentoo). > > What do you get from watchman get-sockname on the command-line? Do the > watchman tests pass? Found the problem. "watchman" binary is not in $PATH but popen() did not complain (or it did but your "2>/dev/null" in watchman_connect suppressed it). BTW you need to update the array size of "expressions" in test_watchman_misc(). So without watchman I got 299.871ms read_index_from:1538 if (verify_hdr(hdr, mmap_size) < 0) go 498.205ms cmd_status:1300 refresh_index(&the_index, REFRESH_QUIE 796.050ms wt_status_collect:622 wt_status_collect_untracked(s) and with watchman ("git status" ran several times to make sure it's cached) 301.950ms read_index_from:1538 if (verify_hdr(hdr, mmap_size) < 0) go 34.918ms read_fs_cache:347 if (verify_hdr(hdr, mmap_size) < 0) go 1564.096ms watchman_load_fs_cache:628 update_fs_cache(istate, result); 161.930ms cmd_status:1300 refresh_index(&the_index, REFRESH_QUIE 251.614ms wt_status_collect:622 wt_status_collect_untracked(s) Given the total time of "git status" without watchman is 1.9s,, update_fs_cache() nearly matches that number alone. All that is spent in the exclude update code in the function, but if you do last_excluding_matching() anyway, why cache it? I think we're paying lookup cost in refresh_index(). I forced CE_VALID bit on as an experiment and refresh_index() went down to 33ms. A bit surprised about wt_status_collect_untracked number. I verified that last_excluding_matching() still runs (on the same number of entries like in no-watchman case). Replacing fs_cache_open() in add_excludes_from_file_to_list() to plain open() does not change the number, so we probably won't need that (unless your worktree is filled with .gitignore, which I doubt it's a norm). So about 500ms saved is in opendir/readdir.. A simple/separate opendir/readdir/closedir program confirms that (ugh!). -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Sat, 2014-05-10 at 12:26 +0700, Duy Nguyen wrote: > On Sat, May 3, 2014 at 6:14 AM, wrote: > > The most sigificant patch uses Facebook's watchman daemon[1] to monitor > > the repository work tree for changes. This makes allows git status > > to avoid traversing the entire work tree to find changes. > > I got "warning: Watchman watch error: Got bad JSON from watchman > get-sockname: '[' or '{' expected near end of file". Any ideas what I > did wrong? I'm using watchman.git and libwatchman.git. check-0.9.11 > and jansson-2.4 were installed by system (gentoo). What do you get from watchman get-sockname on the command-line? Do the watchman tests pass? I don't know that I have tested with Jansson 2.4, but I've tried 2.2.1, 2.5, and 2.6. I also haven't tried on gentoo (I use Ubuntu). But if you still have problems, I can try installing gentoo in a vm and seeing if it's something gentoo-specific. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Sat, May 3, 2014 at 6:14 AM, wrote: > The most sigificant patch uses Facebook's watchman daemon[1] to monitor > the repository work tree for changes. This makes allows git status > to avoid traversing the entire work tree to find changes. Some comments on this series. I still haven't been able to run it with watchman so there are many guesses from my side. First of all, when I set USE_WATCHMAN=Yes in config.mak I expect it to work out of the box, provided that all dependencies are installed. I still need to set WATCHMAN_LIBS for it to build because you only set it with configure script. Would be nice to have a default value for non-configure people too. I'm not so happy that git now needs to link to libjansson.so and libwatchman.so. I would load libwatchman.so at runtime only when core.usewatchman is on, but this is more of personal taste. I still prefer my old tracking model, where the majority of lstat() is done by refresh operation and we only need to optimize those lstat calls, not every single lstat statement in the code base. With that in mind, I think you don't need to keep a fs cache on disk at all. All you need to store (in the index) is the clock value from watchman. After we parse the index, we perform a "since" query to get path names (and perhaps "exists" and "mode" for later). Then we set CE_VALID bit on entries that are _not_ in the query result. The remaining items will be lstat'd by git (see [1] and read-cache.c changes onthe next few patches). Assuming the number of updated files is reasonably small, we won't be punished by lookup time. To avoid write time cost due to possible mass CE_VALID change, assuming split-index is already used, we could store this bit separately as an extension and apply it back at read time. Your changes in dir.c. If I didn't misread it, you still do last_exclude_matching when using fs-cache. That call is where all your CPU is spent and probably explains why you didn't see big perf gain with watchman. I think my untracked cache has dealt correctly with caching in this area. So when you do the watchman query earlier, you also check if any of the directories are updated and force using untracked cache for the rest of the directories. Hope it helps, Duy [1] http://thread.gmane.org/gmane.comp.version-control.git/240339 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Sat, May 3, 2014 at 6:14 AM, wrote: > The most sigificant patch uses Facebook's watchman daemon[1] to monitor > the repository work tree for changes. This makes allows git status > to avoid traversing the entire work tree to find changes. I got "warning: Watchman watch error: Got bad JSON from watchman get-sockname: '[' or '{' expected near end of file". Any ideas what I did wrong? I'm using watchman.git and libwatchman.git. check-0.9.11 and jansson-2.4 were installed by system (gentoo). -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Fri, 2014-05-09 at 11:27 -0700, David Lang wrote: > > > > That's not my understanding from Durham Goode's talk in January. Yes, > > operations involving history go to the server. But the client also > > maintains a copy of the working tree, and it is for this that watchman > > is used. Otherwise, why bother with watchman at all? The server knows > > when it changes files and could simply maintain its own index of what's > > changed. Watchman is built on inotify/fsevents -- it doesn't have > > anything to do with any sort of storage device beyond a vanilla hard > > drive. > > When you have such a massive repo, your clients aren't storing the data on > their > local drives, they are accessing the data on a network attached storage (via > NFS > or through a fuse mount). So they can have their watchman send queries to the > storage server to find out what has changed in this massive repo rather than > having to walk the directory tree (or try to monitor it for changes on the > client machine) Yes, you could do that. But I repeat: that is not what Facebook is actually doing. If it were, they would have no need for inotify or FSEvents, since neither even works with NFS (and a FUSE-based solution would need special support at which point it might as well just implement watchman itself). Please see this section of Durham Goode's talk for evidence that Facebook is not doing that: http://www.youtube.com/watch?v=Dlguc63cRXg&t=10m40s (it's at 10:40 in case that link doesn't work) Maybe things have changed since at Facebook since Durham gave that talk in January and you have knowledge of that. If so, say so. But from what Facebook has said publicly, this is simply not what's going on. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Fri, 9 May 2014, David Turner wrote: On Fri, 2014-05-09 at 11:08 -0700, David Lang wrote: On Fri, 9 May 2014, David Turner wrote: On Fri, 2014-05-09 at 00:08 -0700, David Lang wrote: On Thu, 8 May 2014, Sebastian Schuberth wrote: On 03.05.2014 05:40, Felipe Contreras wrote: That's very interesting. Do you get similar improvements when doing something similar in Merurial (watchman vs . no watchman). I have not tried it. My understanding is that this is why Facebook wrote Watchman and added support for it to Mercurial, so I would assume that the improvements are at least this good. Yeah, my bet is that they are actually much better (because Mercurial can't be so optimized as Git). I'm interested in this number because if watchman in Git is improving it by 30%, but in Mercurial it's improving it by 100% (made up number), therefore it makes sens that you might want it more if you are using hg, but not so much if you are using git. Also, if similar repositories with Mercurial+watchman are actually faster than Git+watchman, that means that there's room for improvement in your implementation. This is not a big issue at this point of the process, just something nice to know. The article at [1] has some details, they claim "For our repository, enabling Watchman integration has made Mercurial's status command more than 5x faster than Git's status command". [1] https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/ a lot of that speed comparison is going to depend on your storage system and the size of your repository. if you have a high-end enterprise storage system that tracks metadata very differently from the file contents (I've seen some that have rackes worth of SATA drives for contents and then 'small' arrays of a few dozen flash drives for the metadata), and then you have very large repositories (Facebook has everything in a single repo), then you have a perfect storm where something like watchman that talks the proprietary protocol of the storage array can be FAR faster than anything that needs to operate with the standard POSIX calls. That can easily account for the difference between the facebook announcement and the results presented for normal disks that show an improvement, but with even stock git being faster than improved mercurial. As I recall from Facebook's presentation[1] on this (as well as from the discussion on the git mailing list[2]), Facebook's test respository is much larger than any known git repository. In particular, it is larger than WebKit. agreed, it's huge, it's the entire codebase history of every tool that they use crammed together in one rep These performance improvements are not for server-side tasks, but for client-side (e.g. git/hg status). Facebook also made other improvements for the client-server communication, and for log/blame, but these are not relevant to watchman. well, in their situation they have shared storage that clients use for this huge repo, so I don't think they have a clear client/server boundry the way you are thinking. Even clients have this huge repo to deal with, and they can do so efficiently by querying the storage device rather than trying to walk the tree or monitor access directly. That's not my understanding from Durham Goode's talk in January. Yes, operations involving history go to the server. But the client also maintains a copy of the working tree, and it is for this that watchman is used. Otherwise, why bother with watchman at all? The server knows when it changes files and could simply maintain its own index of what's changed. Watchman is built on inotify/fsevents -- it doesn't have anything to do with any sort of storage device beyond a vanilla hard drive. When you have such a massive repo, your clients aren't storing the data on their local drives, they are accessing the data on a network attached storage (via NFS or through a fuse mount). So they can have their watchman send queries to the storage server to find out what has changed in this massive repo rather than having to walk the directory tree (or try to monitor it for changes on the client machine) David Lang -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Fri, 2014-05-09 at 11:08 -0700, David Lang wrote: > On Fri, 9 May 2014, David Turner wrote: > > > On Fri, 2014-05-09 at 00:08 -0700, David Lang wrote: > >> On Thu, 8 May 2014, Sebastian Schuberth wrote: > >> > >>> On 03.05.2014 05:40, Felipe Contreras wrote: > >>> > >> That's very interesting. Do you get similar improvements when doing > >> something similar in Merurial (watchman vs . no watchman). > > > > I have not tried it. My understanding is that this is why Facebook > > wrote Watchman and added support for it to Mercurial, so I would assume > > that the improvements are at least this good. > > Yeah, my bet is that they are actually much better (because Mercurial > can't be so optimized as Git). > > I'm interested in this number because if watchman in Git is improving it > by 30%, but in Mercurial it's improving it by 100% (made up number), > therefore it makes sens that you might want it more if you are using hg, > but not so much if you are using git. > > Also, if similar repositories with Mercurial+watchman are actually > faster than Git+watchman, that means that there's room for improvement > in your implementation. This is not a big issue at this point of the > process, just something nice to know. > >>> > >>> The article at [1] has some details, they claim "For our repository, > >>> enabling Watchman integration has made Mercurial's status command more > >>> than 5x faster than Git's status command". > >>> > >>> [1] > >>> https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/ > >> > >> a lot of that speed comparison is going to depend on your storage system > >> and the > >> size of your repository. > >> > >> if you have a high-end enterprise storage system that tracks metadata very > >> differently from the file contents (I've seen some that have rackes worth > >> of > >> SATA drives for contents and then 'small' arrays of a few dozen flash > >> drives for > >> the metadata), and then you have very large repositories (Facebook has > >> everything in a single repo), then you have a perfect storm where > >> something like > >> watchman that talks the proprietary protocol of the storage array can be > >> FAR > >> faster than anything that needs to operate with the standard POSIX calls. > >> > >> That can easily account for the difference between the facebook > >> announcement and > >> the results presented for normal disks that show an improvement, but with > >> even > >> stock git being faster than improved mercurial. > > > > As I recall from Facebook's presentation[1] on this (as well as from the > > discussion on the git mailing list[2]), Facebook's test respository is > > much larger than any known git repository. In particular, it is larger > > than WebKit. > > agreed, it's huge, it's the entire codebase history of every tool that they > use > crammed together in one rep > > > These performance improvements are not for server-side > > tasks, but for client-side (e.g. git/hg status). Facebook also made > > other improvements for the client-server communication, and for > > log/blame, but these are not relevant to watchman. > > well, in their situation they have shared storage that clients use for this > huge > repo, so I don't think they have a clear client/server boundry the way you > are > thinking. Even clients have this huge repo to deal with, and they can do so > efficiently by querying the storage device rather than trying to walk the > tree > or monitor access directly. That's not my understanding from Durham Goode's talk in January. Yes, operations involving history go to the server. But the client also maintains a copy of the working tree, and it is for this that watchman is used. Otherwise, why bother with watchman at all? The server knows when it changes files and could simply maintain its own index of what's changed. Watchman is built on inotify/fsevents -- it doesn't have anything to do with any sort of storage device beyond a vanilla hard drive. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Fri, 9 May 2014, David Turner wrote: On Fri, 2014-05-09 at 00:08 -0700, David Lang wrote: On Thu, 8 May 2014, Sebastian Schuberth wrote: On 03.05.2014 05:40, Felipe Contreras wrote: That's very interesting. Do you get similar improvements when doing something similar in Merurial (watchman vs . no watchman). I have not tried it. My understanding is that this is why Facebook wrote Watchman and added support for it to Mercurial, so I would assume that the improvements are at least this good. Yeah, my bet is that they are actually much better (because Mercurial can't be so optimized as Git). I'm interested in this number because if watchman in Git is improving it by 30%, but in Mercurial it's improving it by 100% (made up number), therefore it makes sens that you might want it more if you are using hg, but not so much if you are using git. Also, if similar repositories with Mercurial+watchman are actually faster than Git+watchman, that means that there's room for improvement in your implementation. This is not a big issue at this point of the process, just something nice to know. The article at [1] has some details, they claim "For our repository, enabling Watchman integration has made Mercurial's status command more than 5x faster than Git's status command". [1] https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/ a lot of that speed comparison is going to depend on your storage system and the size of your repository. if you have a high-end enterprise storage system that tracks metadata very differently from the file contents (I've seen some that have rackes worth of SATA drives for contents and then 'small' arrays of a few dozen flash drives for the metadata), and then you have very large repositories (Facebook has everything in a single repo), then you have a perfect storm where something like watchman that talks the proprietary protocol of the storage array can be FAR faster than anything that needs to operate with the standard POSIX calls. That can easily account for the difference between the facebook announcement and the results presented for normal disks that show an improvement, but with even stock git being faster than improved mercurial. As I recall from Facebook's presentation[1] on this (as well as from the discussion on the git mailing list[2]), Facebook's test respository is much larger than any known git repository. In particular, it is larger than WebKit. agreed, it's huge, it's the entire codebase history of every tool that they use crammed together in one rep These performance improvements are not for server-side tasks, but for client-side (e.g. git/hg status). Facebook also made other improvements for the client-server communication, and for log/blame, but these are not relevant to watchman. well, in their situation they have shared storage that clients use for this huge repo, so I don't think they have a clear client/server boundry the way you are thinking. Even clients have this huge repo to deal with, and they can do so efficiently by querying the storage device rather than trying to walk the tree or monitor access directly. It is entirely possible that, as repo size grows, Mercurial with watchman is faster than git without. With my patches, git status isn't constant-time; it's merely a roughly constant factor faster. My initial design was to make git status constant-time by caching the results of the wt_status_collect calls. This is what you would have to do with traditional storage. My understanding is that the real benefits of watchman show up when you have non-traditional storage and can take advantage of the knowledge that the storage system gathers for it's own use. David Lang But there were so many cases with the various options that I got a bit lost in the wilderness and made a big mess. Maybe I would do better if I tried it again today. And maybe if I just build on top of the untracked-cache code, I would be able to get to constant-time; I'll have to try that at some point. [1] http://www.youtube.com/watch?v=Dlguc63cRXg [2] http://comments.gmane.org/gmane.comp.version-control.git/189776 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Fri, 2014-05-09 at 00:08 -0700, David Lang wrote: > On Thu, 8 May 2014, Sebastian Schuberth wrote: > > > On 03.05.2014 05:40, Felipe Contreras wrote: > > > That's very interesting. Do you get similar improvements when doing > something similar in Merurial (watchman vs . no watchman). > >>> > >>> I have not tried it. My understanding is that this is why Facebook > >>> wrote Watchman and added support for it to Mercurial, so I would assume > >>> that the improvements are at least this good. > >> > >> Yeah, my bet is that they are actually much better (because Mercurial > >> can't be so optimized as Git). > >> > >> I'm interested in this number because if watchman in Git is improving it > >> by 30%, but in Mercurial it's improving it by 100% (made up number), > >> therefore it makes sens that you might want it more if you are using hg, > >> but not so much if you are using git. > >> > >> Also, if similar repositories with Mercurial+watchman are actually > >> faster than Git+watchman, that means that there's room for improvement > >> in your implementation. This is not a big issue at this point of the > >> process, just something nice to know. > > > > The article at [1] has some details, they claim "For our repository, > > enabling Watchman integration has made Mercurial's status command more than > > 5x faster than Git's status command". > > > > [1] > > https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/ > > a lot of that speed comparison is going to depend on your storage system and > the > size of your repository. > > if you have a high-end enterprise storage system that tracks metadata very > differently from the file contents (I've seen some that have rackes worth of > SATA drives for contents and then 'small' arrays of a few dozen flash drives > for > the metadata), and then you have very large repositories (Facebook has > everything in a single repo), then you have a perfect storm where something > like > watchman that talks the proprietary protocol of the storage array can be FAR > faster than anything that needs to operate with the standard POSIX calls. > > That can easily account for the difference between the facebook announcement > and > the results presented for normal disks that show an improvement, but with > even > stock git being faster than improved mercurial. As I recall from Facebook's presentation[1] on this (as well as from the discussion on the git mailing list[2]), Facebook's test respository is much larger than any known git repository. In particular, it is larger than WebKit. These performance improvements are not for server-side tasks, but for client-side (e.g. git/hg status). Facebook also made other improvements for the client-server communication, and for log/blame, but these are not relevant to watchman. It is entirely possible that, as repo size grows, Mercurial with watchman is faster than git without. With my patches, git status isn't constant-time; it's merely a roughly constant factor faster. My initial design was to make git status constant-time by caching the results of the wt_status_collect calls. But there were so many cases with the various options that I got a bit lost in the wilderness and made a big mess. Maybe I would do better if I tried it again today. And maybe if I just build on top of the untracked-cache code, I would be able to get to constant-time; I'll have to try that at some point. [1] http://www.youtube.com/watch?v=Dlguc63cRXg [2] http://comments.gmane.org/gmane.comp.version-control.git/189776 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Thu, 8 May 2014, Sebastian Schuberth wrote: On 03.05.2014 05:40, Felipe Contreras wrote: That's very interesting. Do you get similar improvements when doing something similar in Merurial (watchman vs . no watchman). I have not tried it. My understanding is that this is why Facebook wrote Watchman and added support for it to Mercurial, so I would assume that the improvements are at least this good. Yeah, my bet is that they are actually much better (because Mercurial can't be so optimized as Git). I'm interested in this number because if watchman in Git is improving it by 30%, but in Mercurial it's improving it by 100% (made up number), therefore it makes sens that you might want it more if you are using hg, but not so much if you are using git. Also, if similar repositories with Mercurial+watchman are actually faster than Git+watchman, that means that there's room for improvement in your implementation. This is not a big issue at this point of the process, just something nice to know. The article at [1] has some details, they claim "For our repository, enabling Watchman integration has made Mercurial's status command more than 5x faster than Git's status command". [1] https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/ a lot of that speed comparison is going to depend on your storage system and the size of your repository. if you have a high-end enterprise storage system that tracks metadata very differently from the file contents (I've seen some that have rackes worth of SATA drives for contents and then 'small' arrays of a few dozen flash drives for the metadata), and then you have very large repositories (Facebook has everything in a single repo), then you have a perfect storm where something like watchman that talks the proprietary protocol of the storage array can be FAR faster than anything that needs to operate with the standard POSIX calls. That can easily account for the difference between the facebook announcement and the results presented for normal disks that show an improvement, but with even stock git being faster than improved mercurial. David Lang -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On 03.05.2014 05:40, Felipe Contreras wrote: >>> That's very interesting. Do you get similar improvements when doing >>> something similar in Merurial (watchman vs . no watchman). >> >> I have not tried it. My understanding is that this is why Facebook >> wrote Watchman and added support for it to Mercurial, so I would assume >> that the improvements are at least this good. > > Yeah, my bet is that they are actually much better (because Mercurial > can't be so optimized as Git). > > I'm interested in this number because if watchman in Git is improving it > by 30%, but in Mercurial it's improving it by 100% (made up number), > therefore it makes sens that you might want it more if you are using hg, > but not so much if you are using git. > > Also, if similar repositories with Mercurial+watchman are actually > faster than Git+watchman, that means that there's room for improvement > in your implementation. This is not a big issue at this point of the > process, just something nice to know. The article at [1] has some details, they claim "For our repository, enabling Watchman integration has made Mercurial's status command more than 5x faster than Git's status command". [1] https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/ -- Sebastian Schuberth -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Sun, 2014-05-04 at 07:15 +0700, Duy Nguyen wrote: > > I would like to merge the feature into master. It works well for me, > > and some of my colleagues who have tried it out. > > Have you tried to turn watchman on by default, then run it with git > test suite? That usually helps. I have. The tests work run fine under make, but prove sometimes freezes due to an issue in libwatchman which I just fixed (and which I plan to merge as soon as I can get a colleague to look the changes over). > > I can split the vmac patch into two, but one of them will remain quite > > large because it contains the code for VMAC and AES, which total a bit > > over 100k. Since the list will probably reject that, I'll post a link > > to a repository containing the patches. > > With the read-cache deamon, I think hashing cost is less of an issue, > so new hashing algorithm becomes less important. If you store the file > cache in the deamon's memory only, there's no need to hash anything. > But I guess you already tried this. I agree that with the daemon, the cost is less of an issue, but I am not 100% sure it is a non-issue; consecutive commands that need to read/write the index can still be slowed down. > > I'm not 100% sure how to split the watchman patch up. I could add the > > fs_cache code and then separately add the watchman code that populates > > the cache. Do you think there is a need to divide it up beyond this? > > I'll need to have closer look at your patches to give any suggestions. I have uploaded a new version (which is about 5-10% faster and which corrects some minor changes) to https://github.com/dturner-tw/git.git on the watchman branch. > Although if you don't mind waiting a bit, I can try to put my > untracked cache patches in good shape (hopefully in 2 weeks), then you > can mostly avoid touching dir.c and reuse my work. If the untracked cache patches are going to make it into master, then I would of course be willing to rewrite on top of them. But I would also like to have a sense of whether there is any interest in watchman support (outside of Twitter). For what it's worth, the numbers today for index version 4 are for my superscience repo are: ~380 (no watchman), ~260 (untracked-cache), ~175 (watchman). That's because untracked-cache still has to stat every directory. > I backed away from watchman support because I was worried about its > overhead (of watchman itself, and git/watchman IPC because it's not > designed specifically for git), which led me to try optimizing git as > much as possible without watchman first, then see how/if watchman can > help on top of that. I still think it's a good approach (maybe because > it started to make me doubt if watchman could pull a big performance > win on top to justify the changes to support it) I think on large repositories (especially deeply-nested ones), with the common case of a small number of changes, watchman will end up being a big win. Java tends towards deep nesting (src/main/java/com/twitter/common/...), which is probably why my test repo had the largest speedup (>50%). The IPC overhead might become bad if there were a large number of changes, but so far this has not been an issue for me in testing. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Tue, May 6, 2014 at 7:26 AM, Duy Nguyen wrote: > This discard_index() code was added > recently in e28f764 (unpack-trees: plug a memory leak - 2013-08-13). > As a workaround, perhaps we only do so when the sequencer is running. Hmm.. as o->result does not carry cache-tree anyway, the next assignment after discard_index() will "destroy" cache-tree anyway. I wrote and deleted the following sentence, but it looks like I should rewrite again. So if cache-tree has always been destroyed before, be careful when you try to keep cache-tree as cache-tree invalidation code in unpack-trees.c may not be well tested. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Sat, May 3, 2014 at 7:52 AM, Duy Nguyen wrote: > wt_status_collect_changes_index() depends on how damaged cache-tree is > (in this case, totally scraped). watchman does not help this either. > We need to try to "heal" cache-tree as much as possible to reduce the > number. On the topic of cache-tree, I notice that unpack_trees() will call discard_index() in the end (or even discard_index() on a merge failure). That destroys cache-tree. unpack_trees() is the core of branch switching, or reset/merge, which I consider frequent operations. Cache-tree destruction is bad for "git diff --cached", "git commit" and maybe more. This discard_index() code was added recently in e28f764 (unpack-trees: plug a memory leak - 2013-08-13). As a workaround, perhaps we only do so when the sequencer is running. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Watchman support for git
David Turner wrote: > On Fri, 2014-05-02 at 22:40 -0500, Felipe Contreras wrote: > > David Turner wrote: > > > On Fri, 2014-05-02 at 18:20 -0500, Felipe Contreras wrote: > > > > dturner@ wrote: > > > > > Test repository 1: Linux > > > > > > > > > > Linux is about 45k files in 3k directories. The average length of a > > > > > filename is about 32 bytes. > > > > > > > > > > Git status timing: > > > > > no watchman: 125ms > > > > > watchman: 90ms > > > > > > > > That's very interesting. Do you get similar improvements when doing > > > > something similar in Merurial (watchman vs . no watchman). > > > > > > I have not tried it. My understanding is that this is why Facebook > > > wrote Watchman and added support for it to Mercurial, so I would assume > > > that the improvements are at least this good. > > > > Yeah, my bet is that they are actually much better (because Mercurial > > can't be so optimized as Git). > > > > I'm interested in this number because if watchman in Git is improving it > > by 30%, but in Mercurial it's improving it by 100% (made up number), > > therefore it makes sens that you might want it more if you are using hg, > > but not so much if you are using git. > > > > Also, if similar repositories with Mercurial+watchman are actually > > faster than Git+watchman, that means that there's room for improvement > > in your implementation. This is not a big issue at this point of the > > process, just something nice to know. > > Converting git repos to hg seems to be incredibly slow. (I have not yet > tried doing it with git-remote-hg). I haven't profiled the git->hg conversion, only the other way around, but I don't see why it would be slow in git-remote-hg. The only restriction is that octopus merges are not supported, so you would probably need to find another repository. > But I did find a hg repository for > linux: https://bitbucket.org/orzel/linux-kernel-stable > > On this repo, I get: > hg without watchman: 620ms > hg with watchman: 264ms > (compared to 125ms/90ms for git). > > The number of syscalls is, perhaps also interesting: > no watchman / with watchman > hg 3 / 5421 > git 59180 / 599 > > (About 1/3 of hg's syscalls with watchman seem to be loading Python > stuff) Exactly what I thought :) 28% improvement for Git, 76% improvement for Mercurial. Still, the improvement would be welcome by big companies I bet. Cheers. -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Watchman support for git
On Fri, 2014-05-02 at 22:40 -0500, Felipe Contreras wrote: > David Turner wrote: > > On Fri, 2014-05-02 at 18:20 -0500, Felipe Contreras wrote: > > > dturner@ wrote: > > > > Test repository 1: Linux > > > > > > > > Linux is about 45k files in 3k directories. The average length of a > > > > filename is about 32 bytes. > > > > > > > > Git status timing: > > > > no watchman: 125ms > > > > watchman: 90ms > > > > > > That's very interesting. Do you get similar improvements when doing > > > something similar in Merurial (watchman vs . no watchman). > > > > I have not tried it. My understanding is that this is why Facebook > > wrote Watchman and added support for it to Mercurial, so I would assume > > that the improvements are at least this good. > > Yeah, my bet is that they are actually much better (because Mercurial > can't be so optimized as Git). > > I'm interested in this number because if watchman in Git is improving it > by 30%, but in Mercurial it's improving it by 100% (made up number), > therefore it makes sens that you might want it more if you are using hg, > but not so much if you are using git. > > Also, if similar repositories with Mercurial+watchman are actually > faster than Git+watchman, that means that there's room for improvement > in your implementation. This is not a big issue at this point of the > process, just something nice to know. Converting git repos to hg seems to be incredibly slow. (I have not yet tried doing it with git-remote-hg). But I did find a hg repository for linux: https://bitbucket.org/orzel/linux-kernel-stable On this repo, I get: hg without watchman: 620ms hg with watchman: 264ms (compared to 125ms/90ms for git). The number of syscalls is, perhaps also interesting: no watchman / with watchman hg 3 / 5421 git 59180 / 599 (About 1/3 of hg's syscalls with watchman seem to be loading Python stuff) -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Sun, May 4, 2014 at 3:49 AM, David Turner wrote: > On Sat, 2014-05-03 at 15:49 +0700, Duy Nguyen wrote: >> On Sat, May 3, 2014 at 11:39 AM, David Turner >> wrote: >> >> Index v4 and split index (and the following read-cache daemon, >> >> hopefully) >> > >> > Looking at some of the archives for read-cache daemon, it seems to be >> > somewhat similar to watchman, right? But I only saw inotify code; what >> > about Mac OS? Or am I misunderstanding what it is? >> >> It's mentioned in [1], the second paragraph, mostly to hide index I/O >> read cost and the SHA-1 hashing cost in the background. In theory it >> should work on all platforms that support multiple processes and >> efficient IPC. It can help load watchman file cache faster too. > > Yes, that seems like a good idea. > > I actually wrote some of a more-complicated, weirder version of this > idea. In my version, there was a long-running git daemon process that > held the index, the watchman file cache, and also objects loaded from > the object database. Other git commands would then send their > command-line and arguments over to the daemon, which would run the > commands and send stdin/out/err back. Of course, this is complicated > because git commands are designed to run then exit, so they often rely > on variables being initialized to zero, or fail to free memory. I used > the Boehm GC to handle the memory freeing problem. To handle variables > that needed to be reinitialized, I used __attribute__(section...) to put > them all into one section, which I could save on daemon startup and > restore after each command. I also replaced calls to exit() with a > function that called longjmp() so the daemon could survive commands > failing. Had I continued, I would also have had to track open file > descriptors to avoid leaking those. > > This was a giant mess that only sort-of worked: it was difficult to > track down all of the variables that needed to be reinitialized. > > The advantage of my method is that there was somewhat less data to > marshall over IPC, and that objects could be easily cached; the > disadvantage is complexity, massive code changes, and the fact that it > didn't actually totally work at the time I ran out of time. > > So I'm really looking forward to trying your version! Hm.. I may face the same problem if I'm not careful. So far I think the daemon only holds index data (with on-disk format, not in-memory), mainly to cut out SHA-1 hashing cost. This is still at the idea phase for me though, nothing is materialized yet. > I would like to merge the feature into master. It works well for me, > and some of my colleagues who have tried it out. Have you tried to turn watchman on by default, then run it with git test suite? That usually helps. > I can split the vmac patch into two, but one of them will remain quite > large because it contains the code for VMAC and AES, which total a bit > over 100k. Since the list will probably reject that, I'll post a link > to a repository containing the patches. With the read-cache deamon, I think hashing cost is less of an issue, so new hashing algorithm becomes less important. If you store the file cache in the deamon's memory only, there's no need to hash anything. But I guess you already tried this. > I'm not 100% sure how to split the watchman patch up. I could add the > fs_cache code and then separately add the watchman code that populates > the cache. Do you think there is a need to divide it up beyond this? I'll need to have closer look at your patches to give any suggestions. Although if you don't mind waiting a bit, I can try to put my untracked cache patches in good shape (hopefully in 2 weeks), then you can mostly avoid touching dir.c and reuse my work. I backed away from watchman support because I was worried about its overhead (of watchman itself, and git/watchman IPC because it's not designed specifically for git), which led me to try optimizing git as much as possible without watchman first, then see how/if watchman can help on top of that. I still think it's a good approach (maybe because it started to make me doubt if watchman could pull a big performance win on top to justify the changes to support it) -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Sat, 2014-05-03 at 15:49 +0700, Duy Nguyen wrote: > On Sat, May 3, 2014 at 11:39 AM, David Turner > wrote: > >> Index v4 and split index (and the following read-cache daemon, > >> hopefully) > > > > Looking at some of the archives for read-cache daemon, it seems to be > > somewhat similar to watchman, right? But I only saw inotify code; what > > about Mac OS? Or am I misunderstanding what it is? > > It's mentioned in [1], the second paragraph, mostly to hide index I/O > read cost and the SHA-1 hashing cost in the background. In theory it > should work on all platforms that support multiple processes and > efficient IPC. It can help load watchman file cache faster too. Yes, that seems like a good idea. I actually wrote some of a more-complicated, weirder version of this idea. In my version, there was a long-running git daemon process that held the index, the watchman file cache, and also objects loaded from the object database. Other git commands would then send their command-line and arguments over to the daemon, which would run the commands and send stdin/out/err back. Of course, this is complicated because git commands are designed to run then exit, so they often rely on variables being initialized to zero, or fail to free memory. I used the Boehm GC to handle the memory freeing problem. To handle variables that needed to be reinitialized, I used __attribute__(section...) to put them all into one section, which I could save on daemon startup and restore after each command. I also replaced calls to exit() with a function that called longjmp() so the daemon could survive commands failing. Had I continued, I would also have had to track open file descriptors to avoid leaking those. This was a giant mess that only sort-of worked: it was difficult to track down all of the variables that needed to be reinitialized. The advantage of my method is that there was somewhat less data to marshall over IPC, and that objects could be easily cached; the disadvantage is complexity, massive code changes, and the fact that it didn't actually totally work at the time I ran out of time. So I'm really looking forward to trying your version! > >> The last line could be a competition between watchman and my coming > >> "untracked cache" series. I expect to cut the number in that line at > >> least in half without external dependency. > > > > I hadn't seen the "untracked cached" work (I actually finished these > > patches a month or so ago but have been waiting for some internal > > reviews before sending them out). Looks interesting. It seems we use a > > similar strategy for handling ignores. > > Yep, mostly the same at the core, except that I exploit directory > mtime while you use inotify. Each approach has its own pros and cons, > I think. Both should face the same traps in caching (e.g. if you "git > rm --cached" a file, that file could be come either untracked, or > ignored). > > >> Patch 2/3 did not seem to make it to the list by the way.. > > > > Thanks for your comments. I just tried again to send patch 2/3. I do > > actually see the CC of it in my @twitter.com mailbox, but I don't see it > > in the archives on the web. Do you know if there is a reason the > > mailing list would reject it? > > Probably its size, 131K, which is also an indicator to split it (and > the third patch) into smaller patches if you want to merge this > feature in master eventually. I would like to merge the feature into master. It works well for me, and some of my colleagues who have tried it out. I can split the vmac patch into two, but one of them will remain quite large because it contains the code for VMAC and AES, which total a bit over 100k. Since the list will probably reject that, I'll post a link to a repository containing the patches. I'm not 100% sure how to split the watchman patch up. I could add the fs_cache code and then separately add the watchman code that populates the cache. Do you think there is a need to divide it up beyond this? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Sat, May 3, 2014 at 11:39 AM, David Turner wrote: >> Index v4 and split index (and the following read-cache daemon, >> hopefully) > > Looking at some of the archives for read-cache daemon, it seems to be > somewhat similar to watchman, right? But I only saw inotify code; what > about Mac OS? Or am I misunderstanding what it is? It's mentioned in [1], the second paragraph, mostly to hide index I/O read cost and the SHA-1 hashing cost in the background. In theory it should work on all platforms that support multiple processes and efficient IPC. It can help load watchman file cache faster too. >> The last line could be a competition between watchman and my coming >> "untracked cache" series. I expect to cut the number in that line at >> least in half without external dependency. > > I hadn't seen the "untracked cached" work (I actually finished these > patches a month or so ago but have been waiting for some internal > reviews before sending them out). Looks interesting. It seems we use a > similar strategy for handling ignores. Yep, mostly the same at the core, except that I exploit directory mtime while you use inotify. Each approach has its own pros and cons, I think. Both should face the same traps in caching (e.g. if you "git rm --cached" a file, that file could be come either untracked, or ignored). >> Patch 2/3 did not seem to make it to the list by the way.. > > Thanks for your comments. I just tried again to send patch 2/3. I do > actually see the CC of it in my @twitter.com mailbox, but I don't see it > in the archives on the web. Do you know if there is a reason the > mailing list would reject it? Probably its size, 131K, which is also an indicator to split it (and the third patch) into smaller patches if you want to merge this feature in master eventually. > At any rate, the contents may be found > at > https://github.com/dturner-tw/git/commit/cf587d54fc72d82a23267348afa2c4b60f14ce51.diff Good enough for me :) > >> initial >> reaction is storing the list of all paths seems too much, but I'll >> need to play with it a bit to understand it. > > I wonder if it would make sense to use the untracked cache as the > storage strategy, but use watchman as the update strategy. I'm afraid not. If a directory mtime is changed, which means files/dirs have been added or deleted, the untracked code would fall back to the opendir/readdir/is_excluded dance again on that directory. If we naively do the same using watchman, we lose its advantage that it knows exactly what files/dirs are added/removed. That kind of knowledge can help speed up the dance, which is not stored anywhere in the untracked cache. We could extend the "read-cache daemon" mentioned above though, to hide all the hard work in the background and present a good view to git: when a file/dir is added, read-cache daemon classifies the new files/dirs as tracked/untracked/ignore and update its untracked cache in memory. When "git status" asks about the index and untracked cache, it will receive the _updated_ cache (not the on disk version any more) with latest dir mtime so git can verify the cache is perfect and skip opendir/ All git does is to write the index down in the end to make the updated data permanent. It sounds interesting. But I'm not so sure if it's worth the complexity. [1] http://article.gmane.org/gmane.comp.version-control.git/247268 -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Sat, 2014-05-03 at 07:52 +0700, Duy Nguyen wrote: > On Sat, May 3, 2014 at 6:14 AM, wrote: > > The index format change might be less important with the split index; > > I haven't investigated that since at the time I wrote these patches, > > it didn't exist. > > This is the worst case scenario of "git status" on webkit.git (182k > files, path name 74 byte long on average), hot cache, no SSD > >366.379ms gitmodules_config:199 if (read_cache() < 0) die("index file > 0.004ms cmd_status:1294 read_cache_preload(&s.pathspec); >488.433ms cmd_status:1295 refresh_index(&the_index, REFRESH_QUIE >456.495ms cmd_status:1299 update_index_if_able(&the_index, &inde > 13.088ms wt_status_collect:616 wt_status_collect_changes_worktree(s) >706.926ms wt_status_collect:621 wt_status_collect_changes_index(s) >100.495ms lazy_init_name_hash:136 { int nr; if (istate->name_hash_initia >921.185ms wt_status_collect:622 wt_status_collect_untracked(s) > > real0m2.969s > user0m1.943s > sys 0m1.021s For me, those times are: 0m0.581s (no watchman, index v4) 0m0.465s (watchman, index v4) 0m0.445s (watchman, index v5) That's not huge win on its own, but (a) it's better than nothing and (b) it lays the groundwork for other improvements. A fair amount (~12%) of the time seems to be spent in zlib; this varies based on how the data is packed IIRC. > Index v4 and split index (and the following read-cache daemon, > hopefully) Looking at some of the archives for read-cache daemon, it seems to be somewhat similar to watchman, right? But I only saw inotify code; what about Mac OS? Or am I misunderstanding what it is? > should help reduce numbers of the 1st and 4th lines, I > expect to less than 50ms each line. lazy_init_name_hash could be taken > away with read-cache daemon also. > > core.preloadindex can cut the total number of 2nd and 3rd lines by > half. Watchman should help in these two lines, but it should do better > than core.preloadindex. > > wt_status_collect_changes_index() depends on how damaged cache-tree is > (in this case, totally scraped). watchman does not help this either. > We need to try to "heal" cache-tree as much as possible to reduce the > number. > > The last line could be a competition between watchman and my coming > "untracked cache" series. I expect to cut the number in that line at > least in half without external dependency. I hadn't seen the "untracked cached" work (I actually finished these patches a month or so ago but have been waiting for some internal reviews before sending them out). Looks interesting. It seems we use a similar strategy for handling ignores. > Patch 2/3 did not seem to make it to the list by the way.. Thanks for your comments. I just tried again to send patch 2/3. I do actually see the CC of it in my @twitter.com mailbox, but I don't see it in the archives on the web. Do you know if there is a reason the mailing list would reject it? At any rate, the contents may be found at https://github.com/dturner-tw/git/commit/cf587d54fc72d82a23267348afa2c4b60f14ce51.diff > initial > reaction is storing the list of all paths seems too much, but I'll > need to play with it a bit to understand it. I wonder if it would make sense to use the untracked cache as the storage strategy, but use watchman as the update strategy. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Watchman support for git
David Turner wrote: > On Fri, 2014-05-02 at 18:20 -0500, Felipe Contreras wrote: > > dturner@ wrote: > > > Test repository 1: Linux > > > > > > Linux is about 45k files in 3k directories. The average length of a > > > filename is about 32 bytes. > > > > > > Git status timing: > > > no watchman: 125ms > > > watchman: 90ms > > > > That's very interesting. Do you get similar improvements when doing > > something similar in Merurial (watchman vs . no watchman). > > I have not tried it. My understanding is that this is why Facebook > wrote Watchman and added support for it to Mercurial, so I would assume > that the improvements are at least this good. Yeah, my bet is that they are actually much better (because Mercurial can't be so optimized as Git). I'm interested in this number because if watchman in Git is improving it by 30%, but in Mercurial it's improving it by 100% (made up number), therefore it makes sens that you might want it more if you are using hg, but not so much if you are using git. Also, if similar repositories with Mercurial+watchman are actually faster than Git+watchman, that means that there's room for improvement in your implementation. This is not a big issue at this point of the process, just something nice to know. -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Watchman support for git
On Fri, 2014-05-02 at 18:20 -0500, Felipe Contreras wrote: > dturner@ wrote: > > Test repository 1: Linux > > > > Linux is about 45k files in 3k directories. The average length of a > > filename is about 32 bytes. > > > > Git status timing: > > no watchman: 125ms > > watchman: 90ms > > That's very interesting. Do you get similar improvements when doing > something similar in Merurial (watchman vs . no watchman). I have not tried it. My understanding is that this is why Facebook wrote Watchman and added support for it to Mercurial, so I would assume that the improvements are at least this good. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Watchman support for git
On Sat, May 3, 2014 at 6:14 AM, wrote: > The index format change might be less important with the split index; > I haven't investigated that since at the time I wrote these patches, > it didn't exist. This is the worst case scenario of "git status" on webkit.git (182k files, path name 74 byte long on average), hot cache, no SSD 366.379ms gitmodules_config:199 if (read_cache() < 0) die("index file 0.004ms cmd_status:1294 read_cache_preload(&s.pathspec); 488.433ms cmd_status:1295 refresh_index(&the_index, REFRESH_QUIE 456.495ms cmd_status:1299 update_index_if_able(&the_index, &inde 13.088ms wt_status_collect:616 wt_status_collect_changes_worktree(s) 706.926ms wt_status_collect:621 wt_status_collect_changes_index(s) 100.495ms lazy_init_name_hash:136 { int nr; if (istate->name_hash_initia 921.185ms wt_status_collect:622 wt_status_collect_untracked(s) real0m2.969s user0m1.943s sys 0m1.021s Index v4 and split index (and the following read-cache daemon, hopefully) should help reduce numbers of the 1st and 4th lines, I expect to less than 50ms each line. lazy_init_name_hash could be taken away with read-cache daemon also. core.preloadindex can cut the total number of 2nd and 3rd lines by half. Watchman should help in these two lines, but it should do better than core.preloadindex. wt_status_collect_changes_index() depends on how damaged cache-tree is (in this case, totally scraped). watchman does not help this either. We need to try to "heal" cache-tree as much as possible to reduce the number. The last line could be a competition between watchman and my coming "untracked cache" series. I expect to cut the number in that line at least in half without external dependency. Patch 2/3 did not seem to make it to the list by the way.. initial reaction is storing the list of all paths seems too much, but I'll need to play with it a bit to understand it. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Watchman support for git
dturner@ wrote: > Test repository 1: Linux > > Linux is about 45k files in 3k directories. The average length of a > filename is about 32 bytes. > > Git status timing: > no watchman: 125ms > watchman: 90ms That's very interesting. Do you get similar improvements when doing something similar in Merurial (watchman vs . no watchman). -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Watchman support for git
The most sigificant patch uses Facebook's watchman daemon[1] to monitor the repository work tree for changes. This makes allows git status to avoid traversing the entire work tree to find changes. This change requires libwatchman[2], a client library that I wrote for watchman. While making the watchman change, I also made a change to the index format (contributed here in a separate patch). Index integrity checking uses the same SHA1 algorithm as the rest of git; this is actually relatively slow. It's not a huge part of run-time, but since I wanted to do the same checking for the Watchman filesystem cache (which is about twice as large as the index), I decided to optimize it anyway. I switched to VMAC. VMAC is supposed to be a MAC, but there's no reason it can't be used with a fixed key as a simple integrity check. VMAC is roughly five times faster than SHA1 on my machine; This adds up to a 5% overal speed improvement on git status (depending on the structure of your repo, and about 15% on git diff --cached with no cached changes). The index format change might be less important with the split index; I haven't investigated that since at the time I wrote these patches, it didn't exist. Some numbers follow. They are on my laptop, which has 4x i5-2520M processors, 8GB of RAM, and a solid state disk. They're all tested with a hot cache. Test repository 1: Linux Linux is about 45k files in 3k directories. The average length of a filename is about 32 bytes. Git status timing: no watchman: 125ms watchman: 90ms Test repository 2: Superscience My second test repository (which is a semi-synthetic repo generated from various Twitter internal repos) is somewhat larger than this, and gets a correspondingly larger improvement. It is about 65k files in 20k directories; the average length of a filename is 67 bytes. Git status timing: no watchman, index version 4: 370 ms no watchman, index version 5: 365 ms watchman, index version 4: 170 ms watchman, index version 5: 165 ms [1] https://github.com/facebook/watchman [2] https://github.com/twitter/libwatchman -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html