Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On 6/6/2018 4:13 AM, Ævar Arnfjörð Bjarmason wrote: On Mon, Jan 08 2018, Derrick Stolee wrote: On 1/7/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote: On Sun, Jan 07 2018, Derrick Stolee jotted: git log --oneline --raw --parents Num Packs | Before MIDX | After MIDX | Rel % | 1 pack % --+-+++-- 1 | 35.64 s |35.28 s | -1.0% | -1.0% 24 | 90.81 s |40.06 s | -55.9% | +12.4% 127 |257.97 s |42.25 s | -83.6% | +18.6% The last column is the relative difference between the MIDX-enabled repo and the single-pack repo. The goal of the MIDX feature is to present the ODB as if it was fully repacked, so there is still room for improvement. Changing the command to git log --oneline --raw --parents --abbrev=40 has no observable difference (sub 1% change in all cases). This is likely due to the repack I used putting commits and trees in a small number of packfiles so the MRU cache workes very well. On more naturally-created lists of packfiles, there can be up to 20% improvement on this command. We are using a version of this patch with an upcoming release of GVFS. This feature is particularly important in that space since GVFS performs a "prefetch" step that downloads a pack of commits and trees on a daily basis. These packfiles are placed in an alternate that is shared by all enlistments. Some users have 150+ packfiles and the MRU misses and abbreviation computations are significant. Now, GVFS manages the MIDX file after adding new prefetch packfiles using the following command: git midx --write --update-head --delete-expired --pack-dir= (Not a critique of this, just a (stupid) question) What's the practical use-case for this feature? Since it doesn't help with --abbrev=40 the speedup is all in the part that ensures we don't show an ambiguous SHA-1. The point of including the --abbrev=40 is to point out that object lookups do not get slower with the MIDX feature. Using these "git log" options is a good way to balance object lookups and abbreviations with object parsing and diff machinery.[...] [snip] [...]And while the public data shape I shared did not show a difference, our private testing of the Windows repository did show a valuable improvement when isolating to object lookups and ignoring abbreviation calculations. Replying to this old thread since I see you're prepearing the MIDX for submission again and this seemed like the best venue. You're really good at tracking new commits in my GitHub page ;) Your WIP branch (github.com/git/derrickstolee/midx/upstream) still only references the speedups in abbreviation calculations, but here you allude to other improvements. It would be very nice to have some summary of that in docs / commit messages when you submit this. The new version is essentially a complete rewrite of the feature, since we learned a lot about how to add a new data store from the commit-graph series. The design document [1] refers to some of the immediate benefits and future benefits. Some of these future benefits were discussed at the contributor's summit [2]. [1] https://github.com/derrickstolee/git/blob/midx/upstream/Documentation/technical/midx.txt [2] https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/ Git Merge 2018 Contributor's summit notes (includes discussion of MIDX) I've been meaning to get around to submitting something like I mentioned in https://public-inbox.org/git/87efn0bkls@evledraar.gmail.com/ i.e. a way to expand the abbrev mode to not check disambiguations, which would look something like: core.abbrev = 20 core.validateAbbrev = false Or: core.abbrev = +2 core.validateAbbrev = false So, using the example from the above referenced E-Mail +2 would make linux.git emit hashes of 14 characters, without any abbreviation checking (just trusting in statistics to work in your favor). As noted by JS in this thread that wouldn't be acceptable for your use-case, but there's plenty of people (including me) who'd appreciate the speedup without being a 100% sure we're emitting unambiguous hashes, since that trade-off is better than time spent generating another index on-disk. So I see it as a complimentary & orthogonal feature. But with that implemented I wouldn't get any benefit from things that use the MIDX that aren't abbreviations, so what are those? The MIDX is built for handling many packfiles. As opposed to the commit-graph feature, your repo needs to be _really_big_ to need the MIDX. Most just repack into one packfile on a regular basis. One case for vanilla Git: we've heard from lots of customers disabling gc.auto in their build machines because they didn't want to wait for a repack/gc after a fetch and before a build. Then, they end up in a many-pack situation because they never scheduled time for that repack/gc. For GVFS, we virtualize the repo, so objects can b
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On Mon, Jan 08 2018, Derrick Stolee wrote: > On 1/7/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote: >> >> On Sun, Jan 07 2018, Derrick Stolee jotted: >> >>> git log --oneline --raw --parents >>> >>> Num Packs | Before MIDX | After MIDX | Rel % | 1 pack % >>> --+-+++-- >>> 1 | 35.64 s |35.28 s | -1.0% | -1.0% >>> 24 | 90.81 s |40.06 s | -55.9% | +12.4% >>>127 |257.97 s |42.25 s | -83.6% | +18.6% >>> >>> The last column is the relative difference between the MIDX-enabled repo >>> and the single-pack repo. The goal of the MIDX feature is to present the >>> ODB as if it was fully repacked, so there is still room for improvement. >>> >>> Changing the command to >>> >>> git log --oneline --raw --parents --abbrev=40 >>> >>> has no observable difference (sub 1% change in all cases). This is likely >>> due to the repack I used putting commits and trees in a small number of >>> packfiles so the MRU cache workes very well. On more naturally-created >>> lists of packfiles, there can be up to 20% improvement on this command. >>> >>> We are using a version of this patch with an upcoming release of GVFS. >>> This feature is particularly important in that space since GVFS performs >>> a "prefetch" step that downloads a pack of commits and trees on a daily >>> basis. These packfiles are placed in an alternate that is shared by all >>> enlistments. Some users have 150+ packfiles and the MRU misses and >>> abbreviation computations are significant. Now, GVFS manages the MIDX file >>> after adding new prefetch packfiles using the following command: >>> >>> git midx --write --update-head --delete-expired --pack-dir= >> >> (Not a critique of this, just a (stupid) question) >> >> What's the practical use-case for this feature? Since it doesn't help >> with --abbrev=40 the speedup is all in the part that ensures we don't >> show an ambiguous SHA-1. > > The point of including the --abbrev=40 is to point out that object > lookups do not get slower with the MIDX feature. Using these "git log" > options is a good way to balance object lookups and abbreviations with > object parsing and diff machinery.[...] [snip] > [...]And while the public data shape I shared did not show a > difference, our private testing of the Windows repository did show a > valuable improvement when isolating to object lookups and ignoring > abbreviation calculations. Replying to this old thread since I see you're prepearing the MIDX for submission again and this seemed like the best venue. Your WIP branch (github.com/git/derrickstolee/midx/upstream) still only references the speedups in abbreviation calculations, but here you allude to other improvements. It would be very nice to have some summary of that in docs / commit messages when you submit this. I've been meaning to get around to submitting something like I mentioned in https://public-inbox.org/git/87efn0bkls@evledraar.gmail.com/ i.e. a way to expand the abbrev mode to not check disambiguations, which would look something like: core.abbrev = 20 core.validateAbbrev = false Or: core.abbrev = +2 core.validateAbbrev = false So, using the example from the above referenced E-Mail +2 would make linux.git emit hashes of 14 characters, without any abbreviation checking (just trusting in statistics to work in your favor). As noted by JS in this thread that wouldn't be acceptable for your use-case, but there's plenty of people (including me) who'd appreciate the speedup without being a 100% sure we're emitting unambiguous hashes, since that trade-off is better than time spent generating another index on-disk. So I see it as a complimentary & orthogonal feature. But with that implemented I wouldn't get any benefit from things that use the MIDX that aren't abbreviations, so what are those? >> The reason we do that at all is because it makes for a prettier UI. > > We tried setting core.abbrev=40 on GVFS enlistments to speed up > performance and the users rebelled against the hideous output. They > would rather have slower speeds than long hashes. > >> Are there things that both want the pretty SHA-1 and also care about the >> throughput? I'd have expected machine parsing to just use >> --no-abbrev-commit. > > The --raw flag outputs blob hashes, so the --abbrev=40 covers all hashes. > >> If something cares about both throughput and e.g. is saving the >> abbreviated SHA-1s isn't it better off picking some arbitrary size >> (e.g. --abbrev=20), after all the default abbreviation is going to show >> something as small as possible, which may soon become ambigous after the >> next commit. > > Unfortunately, with the way the abbreviation algorithms work, using > --abbrev=20 will have similar performance problems because you still > need to inspect all packfiles to ensure there isn't a collision in the > first 20 hex characters. > > Thanks, > -Stolee
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On Wednesday, January 10, 2018 02:39:13 PM Derrick Stolee wrote: > On 1/10/2018 1:25 PM, Martin Fick wrote: > > On Sunday, January 07, 2018 01:14:41 PM Derrick Stolee > > > > wrote: > >> This RFC includes a new way to index the objects in > >> multiple packs using one file, called the multi-pack > >> index (MIDX). > > > > ... > > > >> The main goals of this RFC are: > >> > >> * Determine interest in this feature. > >> > >> * Find other use cases for the MIDX feature. > > > > My interest in this feature would be to speed up fetches > > when there is more than one large pack-file with many of > > the same objects that are in other pack-files. What > > does your MIDX design do when it encounters multiple > > copies of the same object in different pack files? > > Does it index them all, or does it keep a single copy? > > The MIDX currently keeps only one reference to each > object. Duplicates are dropped during writing. (See the > care taken in commit 04/18 to avoid duplicates.) Since > midx_sha1_compare() does not use anything other than the > OID to order the objects, there is no decision being made > about which pack is "better". The MIDX writes the first > copy it finds and discards the others. This would likely speed things up then, even if the chosen objects are suboptimal. > It would not be difficult to include a check in > midx_sha1_compare() to favor one packfile over another > based on some measurement (size? mtime?). Since this > would be a heuristic at best, I left it out of the > current patch. Yeah, I didn't know what heuristic to use either, I tended to think that the bigger pack-file would be valuable because it is more likely to share deltas with other objects in that pack, so more easy to send them. However, that is likely only true during clones or other large fetches when we want most objects. During small "update" fetches, the newer packs might be better? I also thought that objects in alternates should be considered less valuable for my use case, however in the github fork use case, the alternates might be more valuable? So yes heuristics, and I don't know what is best. Perhaps some config options could be used to set heuristics like this. Whatever the heuristics are, since they would be a part of the MIDX packing process it would be easy to change. This assumes that keeping only one copy in the index is the right thing. The question would be, what if we need different heuristics for different operations? Would it make sense to have multiple MIDX files covering the same packs then, one for fetch, one for merge...? > > In our Gerrit instance (Gerrit uses jgit), we have > > multiple copies of the linux kernel repos linked > > together via the alternatives file mechanism. > > GVFS also uses alternates for sharing packfiles across > multiple copies of the repo. The MIDX is designed to > cover all packfiles in the same directory, but is not > designed to cover packfiles in multiple alternates; > currently, each alternate would need its own MIDX file. > Does that cause issues with your setup? No, since the other large packfiles are all in other repos (alternates). Is there a reason the MIDX would not want to cover the alternates? If you don't then you would seemingly loose any benefits of the MIDX when you have alternates in use. ... > > It would be nice if this use case could be improved with > > MIDX. To do so, it seems that it would either require > > that MIDX either only put "the best" version of an > > object (i.e. pre-select which one to use), or include > > the extra information to help make the selection > > process of which copy to use (perhaps based on the > > operation being performed) fast. > > I'm not sure if there is sufficient value in storing > multiple references to the same object stored in multiple > packfiles. There could be value in carefully deciding > which copy is "best" during the MIDX write, but during > read is not a good time to make such a decision. It also > increases the size of the file to store multiple copies. Yes, I am not sure either, it would be good to have input from experts here. > > This also leads me to ask, what other additional > > information (bitmaps?) for other operations, besides > > object location, might suddenly be valuable in an index > > that potentially points to multiple copies of objects? > > Would such information be appropriate in MIDX, or would > > it be better in another index? > > For applications to bitmaps, it is probably best that we > only include one copy of each object. Otherwise, we need > to include extra bits in the bitmaps for those copies > (when asking "is this object in the bitmap?"). Agreed. Would the MIDX potentially pave the way to create multi-pack bitmaps? > Thanks for the context with Gerrit's duplicate object > problem. I'll try to incorporate it in to the design > document (commit 01/18) for the v1 patch. FYI, this is not a typical Gerrit thing, it is
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On 1/10/2018 1:25 PM, Martin Fick wrote: On Sunday, January 07, 2018 01:14:41 PM Derrick Stolee wrote: This RFC includes a new way to index the objects in multiple packs using one file, called the multi-pack index (MIDX). ... The main goals of this RFC are: * Determine interest in this feature. * Find other use cases for the MIDX feature. My interest in this feature would be to speed up fetches when there is more than one large pack-file with many of the same objects that are in other pack-files. What does your MIDX design do when it encounters multiple copies of the same object in different pack files? Does it index them all, or does it keep a single copy? The MIDX currently keeps only one reference to each object. Duplicates are dropped during writing. (See the care taken in commit 04/18 to avoid duplicates.) Since midx_sha1_compare() does not use anything other than the OID to order the objects, there is no decision being made about which pack is "better". The MIDX writes the first copy it finds and discards the others. It would not be difficult to include a check in midx_sha1_compare() to favor one packfile over another based on some measurement (size? mtime?). Since this would be a heuristic at best, I left it out of the current patch. In our Gerrit instance (Gerrit uses jgit), we have multiple copies of the linux kernel repos linked together via the alternatives file mechanism. GVFS also uses alternates for sharing packfiles across multiple copies of the repo. The MIDX is designed to cover all packfiles in the same directory, but is not designed to cover packfiles in multiple alternates; currently, each alternate would need its own MIDX file. Does that cause issues with your setup? These repos have many different references (mostly Gerrit change references), but they share most of the common objects from the mainline. I have found that during a large fetch such as a clone, jgit spends a significant amount of extra time by having the extra large pack-files from the other repos visible to it, usually around an extra minute per instance of these (without them, the clone takes around 7mins). This adds up easily with a few repos extra repos, it can almost double the time. My investigations have shown that this is due to jgit searching each of these pack files to decide which version of each object to send. I don't fully understand its selection criteria, however if I shortcut it to just pick the first copy of an object that it finds, I regain my lost time. I don't know if git suffers from a similar problem? If git doesn't suffer from this then it likely just uses the first copy of an object it finds (which may not be the best object to send?) It would be nice if this use case could be improved with MIDX. To do so, it seems that it would either require that MIDX either only put "the best" version of an object (i.e. pre-select which one to use), or include the extra information to help make the selection process of which copy to use (perhaps based on the operation being performed) fast. I'm not sure if there is sufficient value in storing multiple references to the same object stored in multiple packfiles. There could be value in carefully deciding which copy is "best" during the MIDX write, but during read is not a good time to make such a decision. It also increases the size of the file to store multiple copies. This also leads me to ask, what other additional information (bitmaps?) for other operations, besides object location, might suddenly be valuable in an index that potentially points to multiple copies of objects? Would such information be appropriate in MIDX, or would it be better in another index? For applications to bitmaps, it is probably best that we only include one copy of each object. Otherwise, we need to include extra bits in the bitmaps for those copies (when asking "is this object in the bitmap?"). Thanks for the context with Gerrit's duplicate object problem. I'll try to incorporate it in to the design document (commit 01/18) for the v1 patch. Thanks, -Stolee
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On Sunday, January 07, 2018 01:14:41 PM Derrick Stolee wrote: > This RFC includes a new way to index the objects in > multiple packs using one file, called the multi-pack > index (MIDX). ... > The main goals of this RFC are: > > * Determine interest in this feature. > > * Find other use cases for the MIDX feature. My interest in this feature would be to speed up fetches when there is more than one large pack-file with many of the same objects that are in other pack-files. What does your MIDX design do when it encounters multiple copies of the same object in different pack files? Does it index them all, or does it keep a single copy? In our Gerrit instance (Gerrit uses jgit), we have multiple copies of the linux kernel repos linked together via the alternatives file mechanism. These repos have many different references (mostly Gerrit change references), but they share most of the common objects from the mainline. I have found that during a large fetch such as a clone, jgit spends a significant amount of extra time by having the extra large pack-files from the other repos visible to it, usually around an extra minute per instance of these (without them, the clone takes around 7mins). This adds up easily with a few repos extra repos, it can almost double the time. My investigations have shown that this is due to jgit searching each of these pack files to decide which version of each object to send. I don't fully understand its selection criteria, however if I shortcut it to just pick the first copy of an object that it finds, I regain my lost time. I don't know if git suffers from a similar problem? If git doesn't suffer from this then it likely just uses the first copy of an object it finds (which may not be the best object to send?) It would be nice if this use case could be improved with MIDX. To do so, it seems that it would either require that MIDX either only put "the best" version of an object (i.e. pre-select which one to use), or include the extra information to help make the selection process of which copy to use (perhaps based on the operation being performed) fast. This also leads me to ask, what other additional information (bitmaps?) for other operations, besides object location, might suddenly be valuable in an index that potentially points to multiple copies of objects? Would such information be appropriate in MIDX, or would it be better in another index? Thanks, -Martin -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
Hi Stefan, On Tue, 9 Jan 2018, Stefan Beller wrote: > On Tue, Jan 9, 2018 at 5:05 AM, Johannes Schindelin > wrote: > > > Just to throw this out there: --abbrev=8! would be one possible > > convention to state "I want exactly 8 hex digits, don't bother > > checking for uniqueness". > > > > Not sure how urgent such a feature is. > > You seem to have spent some time on rebase, including lamenting on the > in-expressiveness of the instruction sheet (c.f. "rebase a mergy history > sucks") Let's call it "todo list" again? And yes, I spend a *lot* of time in rebase. It's part of my duties as Git for Windows maintainer. > And in that light, I'd like to propose a new naming scheme: > > (a) assume that we "tag" HEAD at the start of the rebase > (b) any abbreviation must be given as committish anchored to said ref: > > pick REBASE_HEAD~1 commit subject > pick REBASE_HEAD~2 distak the gostim > pick REBASE_HEAD~3 Document foo > pick REBASE_HEAD~4 Testing the star-gazer I do not necessarily agree that this is better because it is valid only locally, only in the current rebase (and you enter new problems because you implicitly require REBASE_HEAD to be worktree-local lest you prevent simultaneous rebases in related worktrees). That prevents you from, say, chatting to your colleague and mentioning that commit that you are uncertain about. Imagine asking "Hey Todd, do you know what REBASE_HEAD~2 is all about?". Granted, if you ask about "deadbeef" and it just so happens that this shortened name is ambiguous in Todd's repository. But that is pretty unlikely, isn't it? > And as we have the full descriptiveness of the committishs there, each > commit can be described in a unique way using the graph relationship. I > am just throwing the name REBASE_HEAD out there to trigger some > associations ("similar to FETCH_HEAD"), but I dislike the name. Not only that. It would also be confusing to read after reordering... > (c) this would not solve the problem of mergy history, yet. For that > we'd need to introduce more keywords, that allow us to move around in > the DAG, [...] You probably missed my hint that I have a working solution for that. Please have a look at https://github.com/git/git/pull/447 Short version: for a commit topology like this: A - B - C \ / D the generated list would look like this: # branch D pick 0123 A label branch-point pick 1234 D label D reset branch-point pick 2345 B merge 3456 D C Ciao, Dscho
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On Tue, Jan 09, 2018 at 02:05:09PM +0100, Johannes Schindelin wrote: > > I think that could be easily worked around for rebase by asking git to > > check ambiguity during the conversion. > > Sure. > > It also points to a flaw in your reasoning, and you should take my example > further: previously, we guaranteed unique abbreviations, and who is to say > that there is no script out there relying on that guarantee? You? I'm not sure what you think my reasoning was, since I made the same point directly after. ;) > > I am a bit curious if there's a bounded probability that people would > > find acceptable for Git to give an ambiguous abbreviation. We already > > accept 1 in 2^160, of course. But would, e.g., one in a million be OK? > > What is that going to solve? I'm not sure I understand your question. We can bound the probability of a collision by increasing the size of the abbreviation. My question was whether there is a size where that tradeoff makes sense. So it "solves" the problem of worrying about collisions. > I think a better alternative would be to introduce a new abbreviation mode > that is *intended* to stop caring about unique abbreviations. > > In web interfaces, for example, it makes tons of sense to show, say, 8 > digits in link texts and have the full name in the actual link URL. I'm not sure if that would be useful option or not. I certainly agree that static abbreviations are a useful thing to want. But for scripted callers, it's usually easy to cut down the abbreviation yourself. I think your web interface example is a good one. The caller will ask Git for the full 40-hex hash, and then cut it down itself when generating the link (and I just so happen to have worked on a web interface that does this[1]). So would it be useful for humans to use? That's what I'm not sure about. It seems cumbersome to have to add "--abbrev=8!" on the command line to each invocation. But if it were triggered by config, it seems like we would risk breaking scripts. > I am just hesitant to change things that would break existing setups. Me too. I'm not sure if I'm seriously advocating the "bound the probability" thing. It's just something I think is worth pondering a little. -Peff [1] Actually, there _is_ something that it's useful for Git to tell the caller, which is the approximate number of objects in the repository. Eight digits is not enough for the kernel, for example, and these days we use 12 digits by default. I'm not sure if such a web interface would rather ask for "%H %h" and generate the link text from the second half, or it would rather just get a ballpark on the number of objects, and do the simple abbreviation math itself (right now GitHub does not do either; I don't know about other interfaces).
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
Stefan Beller writes: > Johannes wrote: >> I think a better alternative would be to introduce a new abbreviation mode >> that is *intended* to stop caring about unique abbreviations. >> >> In web interfaces, for example, it makes tons of sense to show, say, 8 >> digits in link texts and have the full name in the actual link URL. > > And that is what (b) would solve, as it is shorter than the full hash and > yet exact. I still do not get it, even though I fully agree that in Web UI what Dscho envisions makes tons of sense. Use some short handle that does not need to be unique inside repository to display, but have a full information that can be used by machines. The shortened ones need to be unique _within_ a given todo list, to be displayed as text to be "clicked", where the A element's href attribute that surrounds that "clickable" text has fully unambiguous information. And that fully unambiguous information, because it is for machine consumption, can be a full object name without any shortening. I do not see a need for REBASE_HEAD~$n to make it less robust (i.e. we now need to worry about making sure it is not lost or moved while we need it and clean it up when we are done, etc.)
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On Tue, Jan 9, 2018 at 12:12 PM, Junio C Hamano wrote: > Stefan Beller writes: > >> And in that light, I'd like to propose a new naming scheme: >> >> (a) assume that we "tag" HEAD at the start of the rebase >> (b) any abbreviation must be given as committish anchored to said ref: >> >> pick REBASE_HEAD~1 commit subject >> pick REBASE_HEAD~2 distak the gostim >> pick REBASE_HEAD~3 Document foo >> pick REBASE_HEAD~4 Testing the star-gazer >> >> And as we have the full descriptiveness of the committishs there, >> each commit can be described in a unique way using the graph relationship. >> I am just throwing the name REBASE_HEAD out there to trigger some >> associations >> ("similar to FETCH_HEAD"), but I dislike the name. >> >> (c) this would not solve the problem of mergy history, yet. For that we'd >> need >> to introduce more keywords, that allow us to move around in the DAG, >> such that we can reset to a specific revision or name revisions on the >> fly. >> So maybe all we need is "reset", "name" (== short lived tag), >> "merge" (rewrite parents of HEAD) to be expressive enough, but still keep >> the line oriented interface? > > It is correct to say that (c) is an issue that is not solved by (b), > but with the current scheme, the commits are internally referenced > by full object names, and just before it is presented to the end > users, these names are abbreviated down to unique prefix. The > machinery expands these abbreviated names back to the full names > before going forward, so it is not an issue that we are creating new > commits during the rebase. > > Does it make it easier to read if we used ~$n notation based on a > fixed point, instead of shortened unique object names? What > improvement is (b) trying to achieve? > Johannes wrote: > I think a better alternative would be to introduce a new abbreviation mode > that is *intended* to stop caring about unique abbreviations. > > In web interfaces, for example, it makes tons of sense to show, say, 8 > digits in link texts and have the full name in the actual link URL. And that is what (b) would solve, as it is shorter than the full hash and yet exact. (c) was mostly speculation on top of (b) if we can take it any further.
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
Stefan Beller writes: > And in that light, I'd like to propose a new naming scheme: > > (a) assume that we "tag" HEAD at the start of the rebase > (b) any abbreviation must be given as committish anchored to said ref: > > pick REBASE_HEAD~1 commit subject > pick REBASE_HEAD~2 distak the gostim > pick REBASE_HEAD~3 Document foo > pick REBASE_HEAD~4 Testing the star-gazer > > And as we have the full descriptiveness of the committishs there, > each commit can be described in a unique way using the graph relationship. > I am just throwing the name REBASE_HEAD out there to trigger some associations > ("similar to FETCH_HEAD"), but I dislike the name. > > (c) this would not solve the problem of mergy history, yet. For that we'd need > to introduce more keywords, that allow us to move around in the DAG, > such that we can reset to a specific revision or name revisions on the > fly. > So maybe all we need is "reset", "name" (== short lived tag), > "merge" (rewrite parents of HEAD) to be expressive enough, but still keep > the line oriented interface? It is correct to say that (c) is an issue that is not solved by (b), but with the current scheme, the commits are internally referenced by full object names, and just before it is presented to the end users, these names are abbreviated down to unique prefix. The machinery expands these abbreviated names back to the full names before going forward, so it is not an issue that we are creating new commits during the rebase. Does it make it easier to read if we used ~$n notation based on a fixed point, instead of shortened unique object names? What improvement is (b) trying to achieve?
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On Tue, Jan 9, 2018 at 5:05 AM, Johannes Schindelin wrote: > Hi Peff, > > On Tue, 9 Jan 2018, Jeff King wrote: > >> On Mon, Jan 08, 2018 at 02:43:00PM +0100, Johannes Schindelin wrote: >> >> > Take the interactive rebase for example. It generates todo lists with >> > abbreviated commit names, for readability (and it is *really* important to >> > keep this readable). As we expect new objects to be introduced by the >> > interactive rebase, we convert that todo list to unabbreviated commit >> > names before executing the interactive rebase. >> > >> > Your idea (to not care about unambiguous abbreviations) would break that. >> >> I think that could be easily worked around for rebase by asking git to >> check ambiguity during the conversion. > > Sure. > > It also points to a flaw in your reasoning, and you should take my example > further: previously, we guaranteed unique abbreviations, and who is to say > that there is no script out there relying on that guarantee? You? > >> But I agree it's a potential problem for other scripts that we might not >> have control over. I hadn't really intended this to be the default >> behavior (my patch was just trying to show the direction). But it does >> make for a pretty awful interface if callers have to opt into it >> manually ("git log --oneline --no-really-go-fast"). > > Exactly. > >> I am a bit curious if there's a bounded probability that people would >> find acceptable for Git to give an ambiguous abbreviation. We already >> accept 1 in 2^160, of course. But would, e.g., one in a million be OK? > > What is that going to solve? > > I think a better alternative would be to introduce a new abbreviation mode > that is *intended* to stop caring about unique abbreviations. > > In web interfaces, for example, it makes tons of sense to show, say, 8 > digits in link texts and have the full name in the actual link URL. > > Currently, we have no way to tell Git: look, I want to have a short label > for this, but I do not really care that this is unambiguous, I just want > something to talk about. > > (And you are correct, of course, that the uniqueness of abbreviations will > no longer be guaranteed for partial clones, and it is already not > guaranteed for shallow clones, and it will only be possible to guarantee > this, ever, for one particular repository at a particular time.) > > I am just hesitant to change things that would break existing setups. > > Just to throw this out there: --abbrev=8! would be one possible convention > to state "I want exactly 8 hex digits, don't bother checking for > uniqueness". > > Not sure how urgent such a feature is. You seem to have spent some time on rebase, including lamenting on the in-expressiveness of the instruction sheet (c.f. "rebase a mergy history sucks") And in that light, I'd like to propose a new naming scheme: (a) assume that we "tag" HEAD at the start of the rebase (b) any abbreviation must be given as committish anchored to said ref: pick REBASE_HEAD~1 commit subject pick REBASE_HEAD~2 distak the gostim pick REBASE_HEAD~3 Document foo pick REBASE_HEAD~4 Testing the star-gazer And as we have the full descriptiveness of the committishs there, each commit can be described in a unique way using the graph relationship. I am just throwing the name REBASE_HEAD out there to trigger some associations ("similar to FETCH_HEAD"), but I dislike the name. (c) this would not solve the problem of mergy history, yet. For that we'd need to introduce more keywords, that allow us to move around in the DAG, such that we can reset to a specific revision or name revisions on the fly. So maybe all we need is "reset", "name" (== short lived tag), "merge" (rewrite parents of HEAD) to be expressive enough, but still keep the line oriented interface? Stefan
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
Hi Peff, On Tue, 9 Jan 2018, Jeff King wrote: > On Mon, Jan 08, 2018 at 02:43:00PM +0100, Johannes Schindelin wrote: > > > Take the interactive rebase for example. It generates todo lists with > > abbreviated commit names, for readability (and it is *really* important to > > keep this readable). As we expect new objects to be introduced by the > > interactive rebase, we convert that todo list to unabbreviated commit > > names before executing the interactive rebase. > > > > Your idea (to not care about unambiguous abbreviations) would break that. > > I think that could be easily worked around for rebase by asking git to > check ambiguity during the conversion. Sure. It also points to a flaw in your reasoning, and you should take my example further: previously, we guaranteed unique abbreviations, and who is to say that there is no script out there relying on that guarantee? You? > But I agree it's a potential problem for other scripts that we might not > have control over. I hadn't really intended this to be the default > behavior (my patch was just trying to show the direction). But it does > make for a pretty awful interface if callers have to opt into it > manually ("git log --oneline --no-really-go-fast"). Exactly. > I am a bit curious if there's a bounded probability that people would > find acceptable for Git to give an ambiguous abbreviation. We already > accept 1 in 2^160, of course. But would, e.g., one in a million be OK? What is that going to solve? I think a better alternative would be to introduce a new abbreviation mode that is *intended* to stop caring about unique abbreviations. In web interfaces, for example, it makes tons of sense to show, say, 8 digits in link texts and have the full name in the actual link URL. Currently, we have no way to tell Git: look, I want to have a short label for this, but I do not really care that this is unambiguous, I just want something to talk about. (And you are correct, of course, that the uniqueness of abbreviations will no longer be guaranteed for partial clones, and it is already not guaranteed for shallow clones, and it will only be possible to guarantee this, ever, for one particular repository at a particular time.) I am just hesitant to change things that would break existing setups. Just to throw this out there: --abbrev=8! would be one possible convention to state "I want exactly 8 hex digits, don't bother checking for uniqueness". Not sure how urgent such a feature is. Ciao, Dscho
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On Mon, Jan 08, 2018 at 08:43:44AM -0500, Derrick Stolee wrote: > > Just to make sure I'm parsing this correctly: normal lookups do get faster > > when you have a single index, given the right setup? > > > > I'm curious what that setup looked like. Is it just tons and tons of > > packs? Is it ones where the packs do not follow the mru patterns very > > well? > > The way I repacked the Linux repo creates an artificially good set of packs > for the MRU cache. When the packfiles are partitioned instead by the time > the objects were pushed to a remote, the MRU cache performs poorly. > Improving these object lookups are a primary reason for the MIDX feature, > and almost all commands improve because of it. 'git log' is just the > simplest to use for demonstration. Interesting. The idea of the pack mru (and the "last pack looked in" before it) is that there would be good locality for time-segmented packs (like those generated by pushes), since most operations also tend to visit history in chronological-ish order (e.g., log). Tree-wide operations would be an exception there, though, since files would have many ages across the tree (though in practice, one would hope that pretty-ancient history would eventually be replaced in a modern tree). I've often wondered, though, if the mru (and "last pack") work mainly because the "normal" distribution for a repository is to have one big pack with most of history, and then a couple of smaller ones (what hasn't been packed yet). Even something as naive as "look in the last pack" works there, because it turns into "look in the biggest pack". If it has most of the objects, it's the most likely place for the previous and the next objects to be. But from my understanding of how you handle the Windows repository, you have tons of equal-sized packs that are never coalesced. Which is quite a different pattern. I would expect your "--max-pack-size" thing to end up with something roughly segmented like pushes, though, just because we do order the write phase in reverse chronological order. Well, mostly. We do each object type in its own chunks, and there's some reordering based on deltas. So given the sizes, it's likely that most of your trees all end up in a few packs. > > There may be other reasons to want MIDX or something like it, but I just > > wonder if we can do this much simpler thing to cover the abbreviation > > case. I guess the question is whether somebody is going to be annoyed in > > the off chance that they hit a collision. > > No only are users going to be annoyed when they hit collisions after > copy-pasting an abbreviated hash, there are also a large number of tools > that people build that use abbreviated hashes (either for presenting to > users or because they didn't turn off abbreviations). > > Abbreviations cause performance issues in other commands, too (like > 'fetch'!), so whatever short-circuit you put in, it would need to be global. > A flag on one builtin would not suffice. Yeah, I agree it's potentially problematic for a lot of reasons. I was just hoping we could bound the probabilities in a way that is acceptable. Maybe it's a fool's errand. -Peff
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On Mon, Jan 08, 2018 at 02:43:00PM +0100, Johannes Schindelin wrote: > Take the interactive rebase for example. It generates todo lists with > abbreviated commit names, for readability (and it is *really* important to > keep this readable). As we expect new objects to be introduced by the > interactive rebase, we convert that todo list to unabbreviated commit > names before executing the interactive rebase. > > Your idea (to not care about unambiguous abbreviations) would break that. I think that could be easily worked around for rebase by asking git to check ambiguity during the conversion. Speed is much less of a problem there, because we're doing a relatively small number of abbreviations (compared to "git log --oneline --raw", which is abbreviating almost every object in the repository). But I agree it's a potential problem for other scripts that we might not have control over. I hadn't really intended this to be the default behavior (my patch was just trying to show the direction). But it does make for a pretty awful interface if callers have to opt into it manually ("git log --oneline --no-really-go-fast"). I am a bit curious if there's a bounded probability that people would find acceptable for Git to give an ambiguous abbreviation. We already accept 1 in 2^160, of course. But would, e.g., one in a million be OK? -Peff
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On 1/8/2018 5:20 AM, Jeff King wrote: On Sun, Jan 07, 2018 at 07:08:54PM -0500, Derrick Stolee wrote: (Not a critique of this, just a (stupid) question) What's the practical use-case for this feature? Since it doesn't help with --abbrev=40 the speedup is all in the part that ensures we don't show an ambiguous SHA-1. The point of including the --abbrev=40 is to point out that object lookups do not get slower with the MIDX feature. Using these "git log" options is a good way to balance object lookups and abbreviations with object parsing and diff machinery. And while the public data shape I shared did not show a difference, our private testing of the Windows repository did show a valuable improvement when isolating to object lookups and ignoring abbreviation calculations. Just to make sure I'm parsing this correctly: normal lookups do get faster when you have a single index, given the right setup? I'm curious what that setup looked like. Is it just tons and tons of packs? Is it ones where the packs do not follow the mru patterns very well? The way I repacked the Linux repo creates an artificially good set of packs for the MRU cache. When the packfiles are partitioned instead by the time the objects were pushed to a remote, the MRU cache performs poorly. Improving these object lookups are a primary reason for the MIDX feature, and almost all commands improve because of it. 'git log' is just the simplest to use for demonstration. I think it's worth thinking a bit about, because... If something cares about both throughput and e.g. is saving the abbreviated SHA-1s isn't it better off picking some arbitrary size (e.g. --abbrev=20), after all the default abbreviation is going to show something as small as possible, which may soon become ambigous after the next commit. Unfortunately, with the way the abbreviation algorithms work, using --abbrev=20 will have similar performance problems because you still need to inspect all packfiles to ensure there isn't a collision in the first 20 hex characters. ...if what we primarily care about speeding up is abbreviations, is it crazy to consider disabling the disambiguation step entirely? The results of find_unique_abbrev are already a bit of a probability game. They're guaranteed at the moment of generation, but as more objects are added, ambiguities may be introduced. Likewise, what's unambiguous for you may not be for somebody else you're communicating with, if they have their own clone. Since we now scale the default abbreviation with the size of the repo, that gives us a bounded and pretty reasonable probability that we won't hit a collision at all[1]. I.e., what if we did something like this: diff --git a/sha1_name.c b/sha1_name.c index 611c7d24dd..04c661ba85 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -600,6 +600,15 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) if (len == GIT_SHA1_HEXSZ || !len) return GIT_SHA1_HEXSZ; + /* +* A default length of 10 implies a repository big enough that it's +* getting expensive to double check the ambiguity of each object, +* and the chance that any particular object of interest has a +* collision is low. +*/ + if (len >= 10) + return len; + mad.init_len = len; mad.cur_len = len; mad.hex = hex; If I repack my linux.git with "--max-pack-size=64m", I get 67 packs. The patch above drops "git log --oneline --raw" on the resulting repo from ~150s to ~30s. With a single pack, it goes from ~33s ~29s. Less impressive, but there's still some benefit. There may be other reasons to want MIDX or something like it, but I just wonder if we can do this much simpler thing to cover the abbreviation case. I guess the question is whether somebody is going to be annoyed in the off chance that they hit a collision. No only are users going to be annoyed when they hit collisions after copy-pasting an abbreviated hash, there are also a large number of tools that people build that use abbreviated hashes (either for presenting to users or because they didn't turn off abbreviations). Abbreviations cause performance issues in other commands, too (like 'fetch'!), so whatever short-circuit you put in, it would need to be global. A flag on one builtin would not suffice. -Peff [1] I'd have to check the numbers, but IIRC we've set the scaling so that the chance of having a _single_ collision in the repository is less than 50%, and rounding to the conservative side (since each hex char gives us 4 bits). And indeed, "git log --oneline --raw" on linux.git does not seem to have any collisions at its default of 12 characters, at least in my clone. We could also consider switching core.disambiguate to "commit", which makes even a collision less likely to annoy the user.
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
Hi Peff, On Mon, 8 Jan 2018, Jeff King wrote: > On Sun, Jan 07, 2018 at 07:08:54PM -0500, Derrick Stolee wrote: > > > > (Not a critique of this, just a (stupid) question) > > > > > > What's the practical use-case for this feature? Since it doesn't > > > help with --abbrev=40 the speedup is all in the part that ensures we > > > don't show an ambiguous SHA-1. > > > > The point of including the --abbrev=40 is to point out that object > > lookups do not get slower with the MIDX feature. Using these "git log" > > options is a good way to balance object lookups and abbreviations with > > object parsing and diff machinery. And while the public data shape I > > shared did not show a difference, our private testing of the Windows > > repository did show a valuable improvement when isolating to object > > lookups and ignoring abbreviation calculations. > > Just to make sure I'm parsing this correctly: normal lookups do get > faster when you have a single index, given the right setup? > > I'm curious what that setup looked like. Is it just tons and tons of > packs? Is it ones where the packs do not follow the mru patterns very > well? > > I think it's worth thinking a bit about, because... > > > > If something cares about both throughput and e.g. is saving the > > > abbreviated SHA-1s isn't it better off picking some arbitrary size > > > (e.g. --abbrev=20), after all the default abbreviation is going to show > > > something as small as possible, which may soon become ambigous after the > > > next commit. > > > > Unfortunately, with the way the abbreviation algorithms work, using > > --abbrev=20 will have similar performance problems because you still need to > > inspect all packfiles to ensure there isn't a collision in the first 20 hex > > characters. > > ...if what we primarily care about speeding up is abbreviations, is it > crazy to consider disabling the disambiguation step entirely? Not crazy. But it would break stuff. Because... > The results of find_unique_abbrev are already a bit of a probability > game. They're guaranteed at the moment of generation, but as more > objects are added, ambiguities may be introduced. Likewise, what's > unambiguous for you may not be for somebody else you're communicating > with, if they have their own clone. ... this is only a probability game in the long term, when you consider new objects to enter from *somewhere*. But in purely local settings, when we expect no new objects to be introduced, we do use known-unambiguous abbreviations. Take the interactive rebase for example. It generates todo lists with abbreviated commit names, for readability (and it is *really* important to keep this readable). As we expect new objects to be introduced by the interactive rebase, we convert that todo list to unabbreviated commit names before executing the interactive rebase. Your idea (to not care about unambiguous abbreviations) would break that. Ciao, Dscho
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On Mon, Jan 08 2018, Jeff King jotted: > On Mon, Jan 08, 2018 at 05:20:29AM -0500, Jeff King wrote: > >> I.e., what if we did something like this: >> >> diff --git a/sha1_name.c b/sha1_name.c >> index 611c7d24dd..04c661ba85 100644 >> --- a/sha1_name.c >> +++ b/sha1_name.c >> @@ -600,6 +600,15 @@ int find_unique_abbrev_r(char *hex, const unsigned char >> *sha1, int len) >> if (len == GIT_SHA1_HEXSZ || !len) >> return GIT_SHA1_HEXSZ; >> >> +/* >> + * A default length of 10 implies a repository big enough that it's >> + * getting expensive to double check the ambiguity of each object, >> + * and the chance that any particular object of interest has a >> + * collision is low. >> + */ >> +if (len >= 10) >> +return len; >> + > > Oops, this really needs to terminate the string in addition to returning > the length (so it was always printing 40 characters in most cases). The > correct patch is below, but it performs the same. > > diff --git a/sha1_name.c b/sha1_name.c > index 611c7d24dd..5921298a80 100644 > --- a/sha1_name.c > +++ b/sha1_name.c > @@ -600,6 +600,17 @@ int find_unique_abbrev_r(char *hex, const unsigned char > *sha1, int len) > if (len == GIT_SHA1_HEXSZ || !len) > return GIT_SHA1_HEXSZ; > > + /* > + * A default length of 10 implies a repository big enough that it's > + * getting expensive to double check the ambiguity of each object, > + * and the chance that any particular object of interest has a > + * collision is low. > + */ > + if (len >= 10) { > + hex[len] = 0; > + return len; > + } > + > mad.init_len = len; > mad.cur_len = len; > mad.hex = hex; That looks much more sensible, leaving aside other potential benefits of MIDX. Given the argument Linus made in e6c587c733 ("abbrev: auto size the default abbreviation", 2016-09-30) maybe we should add a small integer to the length for good measure, i.e. something like: if (len >= 10) { int extra = 2; /* or just 1? or maybe 0 ... */ hex[len + extra] = 0; return len + extra; } I tried running: git log --pretty=format:%h --abbrev=7 | perl -nE 'chomp; say length'|sort|uniq -c|sort -nr On several large repos, which forces something like the disambiguation we had before Linus's patch, on e.g. David Turner's 2015-04-03-1M-git.git test repo it's: 952858 7 44541 8 2861 9 168 10 17 11 2 12 And the default abbreviation picks 12. I haven't yet found a case where it's wrong, but if we wanted to be extra safe we could just add a byte or two to the SHA-1.
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On Mon, Jan 08 2018, Derrick Stolee jotted: > On 1/7/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote: >> If something cares about both throughput and e.g. is saving the >> abbreviated SHA-1s isn't it better off picking some arbitrary size >> (e.g. --abbrev=20), after all the default abbreviation is going to show >> something as small as possible, which may soon become ambigous after the >> next commit. > > Unfortunately, with the way the abbreviation algorithms work, using > --abbrev=20 will have similar performance problems because you still > need to inspect all packfiles to ensure there isn't a collision in the > first 20 hex characters. I meant (but forgot to write) that this would be some new mode, e.g. --abbrev=20 --no-abbrev-check which would just perform a substr() of the 40 character SHA-1. It might be interesting to add that for reasons completely unrelated to your series. Thanks for answering the rest.
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On Mon, Jan 08, 2018 at 05:20:29AM -0500, Jeff King wrote: > I.e., what if we did something like this: > > diff --git a/sha1_name.c b/sha1_name.c > index 611c7d24dd..04c661ba85 100644 > --- a/sha1_name.c > +++ b/sha1_name.c > @@ -600,6 +600,15 @@ int find_unique_abbrev_r(char *hex, const unsigned char > *sha1, int len) > if (len == GIT_SHA1_HEXSZ || !len) > return GIT_SHA1_HEXSZ; > > + /* > + * A default length of 10 implies a repository big enough that it's > + * getting expensive to double check the ambiguity of each object, > + * and the chance that any particular object of interest has a > + * collision is low. > + */ > + if (len >= 10) > + return len; > + Oops, this really needs to terminate the string in addition to returning the length (so it was always printing 40 characters in most cases). The correct patch is below, but it performs the same. diff --git a/sha1_name.c b/sha1_name.c index 611c7d24dd..5921298a80 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -600,6 +600,17 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) if (len == GIT_SHA1_HEXSZ || !len) return GIT_SHA1_HEXSZ; + /* +* A default length of 10 implies a repository big enough that it's +* getting expensive to double check the ambiguity of each object, +* and the chance that any particular object of interest has a +* collision is low. +*/ + if (len >= 10) { + hex[len] = 0; + return len; + } + mad.init_len = len; mad.cur_len = len; mad.hex = hex;
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On Sun, Jan 07, 2018 at 07:08:54PM -0500, Derrick Stolee wrote: > > (Not a critique of this, just a (stupid) question) > > > > What's the practical use-case for this feature? Since it doesn't help > > with --abbrev=40 the speedup is all in the part that ensures we don't > > show an ambiguous SHA-1. > > The point of including the --abbrev=40 is to point out that object lookups > do not get slower with the MIDX feature. Using these "git log" options is a > good way to balance object lookups and abbreviations with object parsing and > diff machinery. And while the public data shape I shared did not show a > difference, our private testing of the Windows repository did show a > valuable improvement when isolating to object lookups and ignoring > abbreviation calculations. Just to make sure I'm parsing this correctly: normal lookups do get faster when you have a single index, given the right setup? I'm curious what that setup looked like. Is it just tons and tons of packs? Is it ones where the packs do not follow the mru patterns very well? I think it's worth thinking a bit about, because... > > If something cares about both throughput and e.g. is saving the > > abbreviated SHA-1s isn't it better off picking some arbitrary size > > (e.g. --abbrev=20), after all the default abbreviation is going to show > > something as small as possible, which may soon become ambigous after the > > next commit. > > Unfortunately, with the way the abbreviation algorithms work, using > --abbrev=20 will have similar performance problems because you still need to > inspect all packfiles to ensure there isn't a collision in the first 20 hex > characters. ...if what we primarily care about speeding up is abbreviations, is it crazy to consider disabling the disambiguation step entirely? The results of find_unique_abbrev are already a bit of a probability game. They're guaranteed at the moment of generation, but as more objects are added, ambiguities may be introduced. Likewise, what's unambiguous for you may not be for somebody else you're communicating with, if they have their own clone. Since we now scale the default abbreviation with the size of the repo, that gives us a bounded and pretty reasonable probability that we won't hit a collision at all[1]. I.e., what if we did something like this: diff --git a/sha1_name.c b/sha1_name.c index 611c7d24dd..04c661ba85 100644 --- a/sha1_name.c +++ b/sha1_name.c @@ -600,6 +600,15 @@ int find_unique_abbrev_r(char *hex, const unsigned char *sha1, int len) if (len == GIT_SHA1_HEXSZ || !len) return GIT_SHA1_HEXSZ; + /* +* A default length of 10 implies a repository big enough that it's +* getting expensive to double check the ambiguity of each object, +* and the chance that any particular object of interest has a +* collision is low. +*/ + if (len >= 10) + return len; + mad.init_len = len; mad.cur_len = len; mad.hex = hex; If I repack my linux.git with "--max-pack-size=64m", I get 67 packs. The patch above drops "git log --oneline --raw" on the resulting repo from ~150s to ~30s. With a single pack, it goes from ~33s ~29s. Less impressive, but there's still some benefit. There may be other reasons to want MIDX or something like it, but I just wonder if we can do this much simpler thing to cover the abbreviation case. I guess the question is whether somebody is going to be annoyed in the off chance that they hit a collision. -Peff [1] I'd have to check the numbers, but IIRC we've set the scaling so that the chance of having a _single_ collision in the repository is less than 50%, and rounding to the conservative side (since each hex char gives us 4 bits). And indeed, "git log --oneline --raw" on linux.git does not seem to have any collisions at its default of 12 characters, at least in my clone. We could also consider switching core.disambiguate to "commit", which makes even a collision less likely to annoy the user.
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On 1/7/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote: On Sun, Jan 07 2018, Derrick Stolee jotted: git log --oneline --raw --parents Num Packs | Before MIDX | After MIDX | Rel % | 1 pack % --+-+++-- 1 | 35.64 s |35.28 s | -1.0% | -1.0% 24 | 90.81 s |40.06 s | -55.9% | +12.4% 127 |257.97 s |42.25 s | -83.6% | +18.6% The last column is the relative difference between the MIDX-enabled repo and the single-pack repo. The goal of the MIDX feature is to present the ODB as if it was fully repacked, so there is still room for improvement. Changing the command to git log --oneline --raw --parents --abbrev=40 has no observable difference (sub 1% change in all cases). This is likely due to the repack I used putting commits and trees in a small number of packfiles so the MRU cache workes very well. On more naturally-created lists of packfiles, there can be up to 20% improvement on this command. We are using a version of this patch with an upcoming release of GVFS. This feature is particularly important in that space since GVFS performs a "prefetch" step that downloads a pack of commits and trees on a daily basis. These packfiles are placed in an alternate that is shared by all enlistments. Some users have 150+ packfiles and the MRU misses and abbreviation computations are significant. Now, GVFS manages the MIDX file after adding new prefetch packfiles using the following command: git midx --write --update-head --delete-expired --pack-dir= (Not a critique of this, just a (stupid) question) What's the practical use-case for this feature? Since it doesn't help with --abbrev=40 the speedup is all in the part that ensures we don't show an ambiguous SHA-1. The point of including the --abbrev=40 is to point out that object lookups do not get slower with the MIDX feature. Using these "git log" options is a good way to balance object lookups and abbreviations with object parsing and diff machinery. And while the public data shape I shared did not show a difference, our private testing of the Windows repository did show a valuable improvement when isolating to object lookups and ignoring abbreviation calculations. The reason we do that at all is because it makes for a prettier UI. We tried setting core.abbrev=40 on GVFS enlistments to speed up performance and the users rebelled against the hideous output. They would rather have slower speeds than long hashes. Are there things that both want the pretty SHA-1 and also care about the throughput? I'd have expected machine parsing to just use --no-abbrev-commit. The --raw flag outputs blob hashes, so the --abbrev=40 covers all hashes. If something cares about both throughput and e.g. is saving the abbreviated SHA-1s isn't it better off picking some arbitrary size (e.g. --abbrev=20), after all the default abbreviation is going to show something as small as possible, which may soon become ambigous after the next commit. Unfortunately, with the way the abbreviation algorithms work, using --abbrev=20 will have similar performance problems because you still need to inspect all packfiles to ensure there isn't a collision in the first 20 hex characters. Thanks, -Stolee
Re: [RFC PATCH 00/18] Multi-pack index (MIDX)
On Sun, Jan 07 2018, Derrick Stolee jotted: > git log --oneline --raw --parents > > Num Packs | Before MIDX | After MIDX | Rel % | 1 pack % > --+-+++-- > 1 | 35.64 s |35.28 s | -1.0% | -1.0% >24 | 90.81 s |40.06 s | -55.9% | +12.4% > 127 |257.97 s |42.25 s | -83.6% | +18.6% > > The last column is the relative difference between the MIDX-enabled repo > and the single-pack repo. The goal of the MIDX feature is to present the > ODB as if it was fully repacked, so there is still room for improvement. > > Changing the command to > > git log --oneline --raw --parents --abbrev=40 > > has no observable difference (sub 1% change in all cases). This is likely > due to the repack I used putting commits and trees in a small number of > packfiles so the MRU cache workes very well. On more naturally-created > lists of packfiles, there can be up to 20% improvement on this command. > > We are using a version of this patch with an upcoming release of GVFS. > This feature is particularly important in that space since GVFS performs > a "prefetch" step that downloads a pack of commits and trees on a daily > basis. These packfiles are placed in an alternate that is shared by all > enlistments. Some users have 150+ packfiles and the MRU misses and > abbreviation computations are significant. Now, GVFS manages the MIDX file > after adding new prefetch packfiles using the following command: > > git midx --write --update-head --delete-expired --pack-dir= (Not a critique of this, just a (stupid) question) What's the practical use-case for this feature? Since it doesn't help with --abbrev=40 the speedup is all in the part that ensures we don't show an ambiguous SHA-1. The reason we do that at all is because it makes for a prettier UI. Are there things that both want the pretty SHA-1 and also care about the throughput? I'd have expected machine parsing to just use --no-abbrev-commit. If something cares about both throughput and e.g. is saving the abbreviated SHA-1s isn't it better off picking some arbitrary size (e.g. --abbrev=20), after all the default abbreviation is going to show something as small as possible, which may soon become ambigous after the next commit.
[RFC PATCH 00/18] Multi-pack index (MIDX)
This RFC includes a new way to index the objects in multiple packs using one file, called the multi-pack index (MIDX). The commits are split into parts as follows: [01] - A full design document. [02] - The full file format for MIDX files. [03] - Creation of core.midx config setting. [04-12] - Creation of "midx" builtin that writes, reads, and deletes MIDX files. [13-18] - Consume MIDX files for abbreviations and object loads. The main goals of this RFC are: * Determine interest in this feature. * Find other use cases for the MIDX feature. * Design a proper command-line interface for constructing and checking MIDX files. The current "midx" builtin is probably inadequate. * Determine what additional changes are needed before the feature can be merged. Specifically, I'm interested in the interactions with repack and fsck. The current patch also does not update the MIDX on a fetch (which adds a packfile) but would be valuable. Whenever possible, I tried to leave out features that could be added in a later patch. * Consider splitting this patch into multiple patches, such as: i. The MIDX design document. ii. The command-line interface for building and reading MIDX files. iii. Integrations with abbreviations and object lookups. Please do not send any style nits to this patch, as I expect the code to change dramatically before we consider merging. I created three copies of the Linux repo with 1, 24, and 127 packfiles each using 'git repack -adfF --max-pack-size=[64m|16m]'. These copies gave significant performance improvements on the following comand: git log --oneline --raw --parents Num Packs | Before MIDX | After MIDX | Rel % | 1 pack % --+-+++-- 1 | 35.64 s |35.28 s | -1.0% | -1.0% 24 | 90.81 s |40.06 s | -55.9% | +12.4% 127 |257.97 s |42.25 s | -83.6% | +18.6% The last column is the relative difference between the MIDX-enabled repo and the single-pack repo. The goal of the MIDX feature is to present the ODB as if it was fully repacked, so there is still room for improvement. Changing the command to git log --oneline --raw --parents --abbrev=40 has no observable difference (sub 1% change in all cases). This is likely due to the repack I used putting commits and trees in a small number of packfiles so the MRU cache workes very well. On more naturally-created lists of packfiles, there can be up to 20% improvement on this command. We are using a version of this patch with an upcoming release of GVFS. This feature is particularly important in that space since GVFS performs a "prefetch" step that downloads a pack of commits and trees on a daily basis. These packfiles are placed in an alternate that is shared by all enlistments. Some users have 150+ packfiles and the MRU misses and abbreviation computations are significant. Now, GVFS manages the MIDX file after adding new prefetch packfiles using the following command: git midx --write --update-head --delete-expired --pack-dir= As that release deploys we will gather more specific numbers on the performance improvements and report them in this thread. Derrick Stolee (18): docs: Multi-Pack Index (MIDX) Design Notes midx: specify midx file format midx: create core.midx config setting midx: write multi-pack indexes for an object list midx: create midx builtin with --write mode midx: add t5318-midx.sh test script midx: teach midx --write to update midx-head midx: teach git-midx to read midx file details midx: find details of nth object in midx midx: use existing midx when writing midx: teach git-midx to clear midx files midx: teach git-midx to delete expired files t5318-midx.h: confirm git actions are stable midx: load midx files when loading packs midx: use midx for approximate object count midx: nth_midxed_object_oid() and bsearch_midx() sha1_name: use midx for abbreviations packfile: use midx for object loads .gitignore | 1 + Documentation/config.txt | 3 + Documentation/git-midx.txt | 106 Documentation/technical/multi-pack-index.txt | 149 + Documentation/technical/pack-format.txt | 85 +++ Makefile | 2 + builtin.h| 1 + builtin/midx.c | 352 +++ cache.h | 1 + command-list.txt | 1 + config.c | 5 + environment.c| 2 + git.c| 1 + midx.c | 850 +++ midx.h | 136 + packfile.c | 79 ++- packfile.h | 2 + sha1_n