Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-06-06 Thread Derrick Stolee

On 6/6/2018 4:13 AM, Ævar Arnfjörð Bjarmason wrote:

On Mon, Jan 08 2018, Derrick Stolee wrote:


On 1/7/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:

On Sun, Jan 07 2018, Derrick Stolee jotted:


  git log --oneline --raw --parents

Num Packs | Before MIDX | After MIDX |  Rel % | 1 pack %
--+-+++--
  1 | 35.64 s |35.28 s |  -1.0% |   -1.0%
 24 | 90.81 s |40.06 s | -55.9% |  +12.4%
127 |257.97 s |42.25 s | -83.6% |  +18.6%

The last column is the relative difference between the MIDX-enabled repo
and the single-pack repo. The goal of the MIDX feature is to present the
ODB as if it was fully repacked, so there is still room for improvement.

Changing the command to

  git log --oneline --raw --parents --abbrev=40

has no observable difference (sub 1% change in all cases). This is likely
due to the repack I used putting commits and trees in a small number of
packfiles so the MRU cache workes very well. On more naturally-created
lists of packfiles, there can be up to 20% improvement on this command.

We are using a version of this patch with an upcoming release of GVFS.
This feature is particularly important in that space since GVFS performs
a "prefetch" step that downloads a pack of commits and trees on a daily
basis. These packfiles are placed in an alternate that is shared by all
enlistments. Some users have 150+ packfiles and the MRU misses and
abbreviation computations are significant. Now, GVFS manages the MIDX file
after adding new prefetch packfiles using the following command:

  git midx --write --update-head --delete-expired --pack-dir=

(Not a critique of this, just a (stupid) question)

What's the practical use-case for this feature? Since it doesn't help
with --abbrev=40 the speedup is all in the part that ensures we don't
show an ambiguous SHA-1.

The point of including the --abbrev=40 is to point out that object
lookups do not get slower with the MIDX feature. Using these "git log"
options is a good way to balance object lookups and abbreviations with
object parsing and diff machinery.[...]

[snip]


[...]And while the public data shape I shared did not show a
difference, our private testing of the Windows repository did show a
valuable improvement when isolating to object lookups and ignoring
abbreviation calculations.

Replying to this old thread since I see you're prepearing the MIDX for
submission again and this seemed like the best venue.


You're really good at tracking new commits in my GitHub page ;)



Your WIP branch (github.com/git/derrickstolee/midx/upstream) still only
references the speedups in abbreviation calculations, but here you
allude to other improvements. It would be very nice to have some summary
of that in docs / commit messages when you submit this.


The new version is essentially a complete rewrite of the feature, since 
we learned a lot about how to add a new data store from the commit-graph 
series. The design document [1] refers to some of the immediate benefits 
and future benefits. Some of these future benefits were discussed at the 
contributor's summit [2].


[1] 
https://github.com/derrickstolee/git/blob/midx/upstream/Documentation/technical/midx.txt


[2] 
https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/

    Git Merge 2018 Contributor's summit notes (includes discussion of MIDX)



I've been meaning to get around to submitting something like I mentioned
in https://public-inbox.org/git/87efn0bkls@evledraar.gmail.com/
i.e. a way to expand the abbrev mode to not check disambiguations, which
would look something like:

 core.abbrev = 20
 core.validateAbbrev = false

Or:

 core.abbrev = +2
 core.validateAbbrev = false

So, using the example from the above referenced E-Mail +2 would make
linux.git emit hashes of 14 characters, without any abbreviation
checking (just trusting in statistics to work in your favor).

As noted by JS in this thread that wouldn't be acceptable for your
use-case, but there's plenty of people (including me) who'd appreciate
the speedup without being a 100% sure we're emitting unambiguous hashes,
since that trade-off is better than time spent generating another index
on-disk. So I see it as a complimentary & orthogonal feature.

But with that implemented I wouldn't get any benefit from things that
use the MIDX that aren't abbreviations, so what are those?


The MIDX is built for handling many packfiles. As opposed to the 
commit-graph feature, your repo needs to be _really_big_ to need the 
MIDX. Most just repack into one packfile on a regular basis.


One case for vanilla Git: we've heard from lots of customers disabling 
gc.auto in their build machines because they didn't want to wait for a 
repack/gc after a fetch and before a build. Then, they end up in a 
many-pack situation because they never scheduled time for that repack/gc.


For GVFS, we virtualize the repo, so objects can b

Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-06-06 Thread Ævar Arnfjörð Bjarmason


On Mon, Jan 08 2018, Derrick Stolee wrote:

> On 1/7/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Sun, Jan 07 2018, Derrick Stolee jotted:
>>
>>>  git log --oneline --raw --parents
>>>
>>> Num Packs | Before MIDX | After MIDX |  Rel % | 1 pack %
>>> --+-+++--
>>>  1 | 35.64 s |35.28 s |  -1.0% |   -1.0%
>>> 24 | 90.81 s |40.06 s | -55.9% |  +12.4%
>>>127 |257.97 s |42.25 s | -83.6% |  +18.6%
>>>
>>> The last column is the relative difference between the MIDX-enabled repo
>>> and the single-pack repo. The goal of the MIDX feature is to present the
>>> ODB as if it was fully repacked, so there is still room for improvement.
>>>
>>> Changing the command to
>>>
>>>  git log --oneline --raw --parents --abbrev=40
>>>
>>> has no observable difference (sub 1% change in all cases). This is likely
>>> due to the repack I used putting commits and trees in a small number of
>>> packfiles so the MRU cache workes very well. On more naturally-created
>>> lists of packfiles, there can be up to 20% improvement on this command.
>>>
>>> We are using a version of this patch with an upcoming release of GVFS.
>>> This feature is particularly important in that space since GVFS performs
>>> a "prefetch" step that downloads a pack of commits and trees on a daily
>>> basis. These packfiles are placed in an alternate that is shared by all
>>> enlistments. Some users have 150+ packfiles and the MRU misses and
>>> abbreviation computations are significant. Now, GVFS manages the MIDX file
>>> after adding new prefetch packfiles using the following command:
>>>
>>>  git midx --write --update-head --delete-expired --pack-dir=
>>
>> (Not a critique of this, just a (stupid) question)
>>
>> What's the practical use-case for this feature? Since it doesn't help
>> with --abbrev=40 the speedup is all in the part that ensures we don't
>> show an ambiguous SHA-1.
>
> The point of including the --abbrev=40 is to point out that object
> lookups do not get slower with the MIDX feature. Using these "git log"
> options is a good way to balance object lookups and abbreviations with
> object parsing and diff machinery.[...]

[snip]

> [...]And while the public data shape I shared did not show a
> difference, our private testing of the Windows repository did show a
> valuable improvement when isolating to object lookups and ignoring
> abbreviation calculations.

Replying to this old thread since I see you're prepearing the MIDX for
submission again and this seemed like the best venue.

Your WIP branch (github.com/git/derrickstolee/midx/upstream) still only
references the speedups in abbreviation calculations, but here you
allude to other improvements. It would be very nice to have some summary
of that in docs / commit messages when you submit this.

I've been meaning to get around to submitting something like I mentioned
in https://public-inbox.org/git/87efn0bkls@evledraar.gmail.com/
i.e. a way to expand the abbrev mode to not check disambiguations, which
would look something like:

core.abbrev = 20
core.validateAbbrev = false

Or:

core.abbrev = +2
core.validateAbbrev = false

So, using the example from the above referenced E-Mail +2 would make
linux.git emit hashes of 14 characters, without any abbreviation
checking (just trusting in statistics to work in your favor).

As noted by JS in this thread that wouldn't be acceptable for your
use-case, but there's plenty of people (including me) who'd appreciate
the speedup without being a 100% sure we're emitting unambiguous hashes,
since that trade-off is better than time spent generating another index
on-disk. So I see it as a complimentary & orthogonal feature.

But with that implemented I wouldn't get any benefit from things that
use the MIDX that aren't abbreviations, so what are those?

>> The reason we do that at all is because it makes for a prettier UI.
>
> We tried setting core.abbrev=40 on GVFS enlistments to speed up
> performance and the users rebelled against the hideous output. They
> would rather have slower speeds than long hashes.
>
>> Are there things that both want the pretty SHA-1 and also care about the
>> throughput? I'd have expected machine parsing to just use
>> --no-abbrev-commit.
>
> The --raw flag outputs blob hashes, so the --abbrev=40 covers all hashes.
>
>> If something cares about both throughput and e.g. is saving the
>> abbreviated SHA-1s isn't it better off picking some arbitrary size
>> (e.g. --abbrev=20), after all the default abbreviation is going to show
>> something as small as possible, which may soon become ambigous after the
>> next commit.
>
> Unfortunately, with the way the abbreviation algorithms work, using
> --abbrev=20 will have similar performance problems because you still
> need to inspect all packfiles to ensure there isn't a collision in the
> first 20 hex characters.
>
> Thanks,
> -Stolee


Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-10 Thread Martin Fick
On Wednesday, January 10, 2018 02:39:13 PM Derrick Stolee 
wrote:
> On 1/10/2018 1:25 PM, Martin Fick wrote:
> > On Sunday, January 07, 2018 01:14:41 PM Derrick Stolee
> > 
> > wrote:
> >> This RFC includes a new way to index the objects in
> >> multiple packs using one file, called the multi-pack
> >> index (MIDX).
> > 
> > ...
> > 
> >> The main goals of this RFC are:
> >> 
> >> * Determine interest in this feature.
> >> 
> >> * Find other use cases for the MIDX feature.
> > 
> > My interest in this feature would be to speed up fetches
> > when there is more than one large pack-file with many of
> > the same objects that are in other pack-files.   What
> > does your MIDX design do when it encounters multiple
> > copies of the same object in different pack files? 
> > Does it index them all, or does it keep a single copy?
> 
> The MIDX currently keeps only one reference to each
> object. Duplicates are dropped during writing. (See the
> care taken in commit 04/18 to avoid duplicates.) Since
> midx_sha1_compare() does not use anything other than the
> OID to order the objects, there is no decision being made
> about which pack is "better". The MIDX writes the first
> copy it finds and discards the others.

This would likely speed things up then, even if the chosen 
objects are suboptimal.

> It would not be difficult to include a check in
> midx_sha1_compare() to favor one packfile over another
> based on some measurement (size? mtime?). Since this
> would be a heuristic at best, I left it out of the
> current patch.

Yeah, I didn't know what heuristic to use either, I tended 
to think that the bigger pack-file would be valuable because 
it is more likely to share deltas with other objects in that 
pack, so more easy to send them.  However, that is likely 
only true during clones or other large fetches when we want 
most objects.  During small "update" fetches, the newer 
packs might be better?

I also thought that objects in alternates should be 
considered less valuable for my use case, however in the 
github fork use case, the alternates might be more valuable?

So yes heuristics, and I don't know what is best.  Perhaps 
some config options could be used to set heuristics like 
this.  Whatever the heuristics are, since they would be a 
part of the MIDX packing process it would be easy to change.  
This assumes that keeping only one copy in the index is the 
right thing.  The question would be, what if we need 
different heuristics for different operations?  Would it make 
sense to have multiple MIDX files covering the same packs 
then, one for fetch, one for merge...?

> > In our Gerrit instance (Gerrit uses jgit), we have
> > multiple copies of the linux kernel repos linked
> > together via the alternatives file mechanism.
> 
> GVFS also uses alternates for sharing packfiles across
> multiple copies of the repo. The MIDX is designed to
> cover all packfiles in the same directory, but is not
> designed to cover packfiles in multiple alternates;
> currently, each alternate would need its own MIDX file.
> Does that cause issues with your setup?

No, since the other large packfiles are all in other repos 
(alternates).  Is there a reason the MIDX would not want to 
cover the alternates?  If you don't then you would seemingly 
loose any benefits of the MIDX when you have alternates in 
use.

...
> > It would be nice if this use case could be improved with
> > MIDX.  To do so, it seems that it would either require
> > that MIDX either only put "the best" version of an
> > object (i.e. pre-select which one to use), or include
> > the extra information to help make the selection
> > process of which copy to use (perhaps based on the
> > operation being performed) fast.
> 
> I'm not sure if there is sufficient value in storing
> multiple references to the same object stored in multiple
> packfiles. There could be value in carefully deciding
> which copy is "best" during the MIDX write, but during
> read is not a good time to make such a decision. It also
> increases the size of the file to store multiple copies.

Yes, I am not sure either, it would be good to have input 
from experts here.

> > This also leads me to ask, what other additional
> > information (bitmaps?) for other operations, besides
> > object location, might suddenly be valuable in an index
> > that potentially points to multiple copies of objects? 
> > Would such information be appropriate in MIDX, or would
> > it be better in another index?
> 
> For applications to bitmaps, it is probably best that we
> only include one copy of each object. Otherwise, we need
> to include extra bits in the bitmaps for those copies
> (when asking "is this object in the bitmap?").

Agreed.  Would the MIDX potentially pave the way to create 
multi-pack bitmaps?

> Thanks for the context with Gerrit's duplicate object
> problem. I'll try to incorporate it in to the design
> document (commit 01/18) for the v1 patch.

FYI, this is not a typical Gerrit thing, it is 

Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-10 Thread Derrick Stolee

On 1/10/2018 1:25 PM, Martin Fick wrote:

On Sunday, January 07, 2018 01:14:41 PM Derrick Stolee
wrote:

This RFC includes a new way to index the objects in
multiple packs using one file, called the multi-pack
index (MIDX).

...

The main goals of this RFC are:

* Determine interest in this feature.

* Find other use cases for the MIDX feature.

My interest in this feature would be to speed up fetches
when there is more than one large pack-file with many of the
same objects that are in other pack-files.   What does your
MIDX design do when it encounters multiple copies of the
same object in different pack files?  Does it index them all,
or does it keep a single copy?


The MIDX currently keeps only one reference to each object. Duplicates 
are dropped during writing. (See the care taken in commit 04/18 to avoid 
duplicates.) Since midx_sha1_compare() does not use anything other than 
the OID to order the objects, there is no decision being made about 
which pack is "better". The MIDX writes the first copy it finds and 
discards the others.


It would not be difficult to include a check in midx_sha1_compare() to 
favor one packfile over another based on some measurement (size? 
mtime?). Since this would be a heuristic at best, I left it out of the 
current patch.



In our Gerrit instance (Gerrit uses jgit), we have multiple
copies of the linux kernel repos linked together via the
alternatives file mechanism.


GVFS also uses alternates for sharing packfiles across multiple copies 
of the repo. The MIDX is designed to cover all packfiles in the same 
directory, but is not designed to cover packfiles in multiple 
alternates; currently, each alternate would need its own MIDX file. Does 
that cause issues with your setup?



   These repos have many different
references (mostly Gerrit change references), but they share
most of the common objects from the mainline.  I have found
that during a large fetch such as a clone, jgit spends a
significant amount of extra time by having the extra large
pack-files from the other repos visible to it, usually around
an extra minute per instance of these (without them, the
clone takes around 7mins).  This adds up easily with a few
repos extra repos, it can almost double the time.

My investigations have shown that this is due to jgit
searching each of these pack files to decide which version of
each object to send.  I don't fully understand its selection
criteria, however if I shortcut it to just pick the first
copy of an object that it finds, I regain my lost time.  I
don't know if git suffers from a similar problem?  If git
doesn't suffer from this then it likely just uses the first
copy of an object it finds (which may not be the best object
to send?)

It would be nice if this use case could be improved with
MIDX.  To do so, it seems that it would either require that
MIDX either only put "the best" version of an object (i.e.
pre-select which one to use), or include the extra
information to help make the selection process of which copy
to use (perhaps based on the operation being performed)
fast.


I'm not sure if there is sufficient value in storing multiple references 
to the same object stored in multiple packfiles. There could be value in 
carefully deciding which copy is "best" during the MIDX write, but 
during read is not a good time to make such a decision. It also 
increases the size of the file to store multiple copies.



This also leads me to ask, what other additional information
(bitmaps?) for other operations, besides object location,
might suddenly be valuable in an index that potentially
points to multiple copies of objects?  Would such
information be appropriate in MIDX, or would it be better in
another index?


For applications to bitmaps, it is probably best that we only include 
one copy of each object. Otherwise, we need to include extra bits in the 
bitmaps for those copies (when asking "is this object in the bitmap?").


Thanks for the context with Gerrit's duplicate object problem. I'll try 
to incorporate it in to the design document (commit 01/18) for the v1 patch.


Thanks,
-Stolee



Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-10 Thread Martin Fick
On Sunday, January 07, 2018 01:14:41 PM Derrick Stolee 
wrote:
> This RFC includes a new way to index the objects in
> multiple packs using one file, called the multi-pack
> index (MIDX).
...
> The main goals of this RFC are:
> 
> * Determine interest in this feature.
> 
> * Find other use cases for the MIDX feature.

My interest in this feature would be to speed up fetches 
when there is more than one large pack-file with many of the 
same objects that are in other pack-files.   What does your 
MIDX design do when it encounters multiple copies of the 
same object in different pack files?  Does it index them all, 
or does it keep a single copy?

In our Gerrit instance (Gerrit uses jgit), we have multiple 
copies of the linux kernel repos linked together via the 
alternatives file mechanism.  These repos have many different 
references (mostly Gerrit change references), but they share 
most of the common objects from the mainline.  I have found 
that during a large fetch such as a clone, jgit spends a 
significant amount of extra time by having the extra large 
pack-files from the other repos visible to it, usually around 
an extra minute per instance of these (without them, the 
clone takes around 7mins).  This adds up easily with a few 
repos extra repos, it can almost double the time.

My investigations have shown that this is due to jgit 
searching each of these pack files to decide which version of 
each object to send.  I don't fully understand its selection 
criteria, however if I shortcut it to just pick the first 
copy of an object that it finds, I regain my lost time.  I 
don't know if git suffers from a similar problem?  If git 
doesn't suffer from this then it likely just uses the first 
copy of an object it finds (which may not be the best object 
to send?)

It would be nice if this use case could be improved with 
MIDX.  To do so, it seems that it would either require that 
MIDX either only put "the best" version of an object (i.e. 
pre-select which one to use), or include the extra 
information to help make the selection process of which copy 
to use (perhaps based on the operation being performed) 
fast.

This also leads me to ask, what other additional information 
(bitmaps?) for other operations, besides object location, 
might suddenly be valuable in an index that potentially 
points to multiple copies of objects?  Would such 
information be appropriate in MIDX, or would it be better in 
another index?

Thanks,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation



Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-10 Thread Johannes Schindelin
Hi Stefan,

On Tue, 9 Jan 2018, Stefan Beller wrote:

> On Tue, Jan 9, 2018 at 5:05 AM, Johannes Schindelin
>  wrote:
>
> > Just to throw this out there: --abbrev=8! would be one possible
> > convention to state "I want exactly 8 hex digits, don't bother
> > checking for uniqueness".
> >
> > Not sure how urgent such a feature is.
> 
> You seem to have spent some time on rebase, including lamenting on the
> in-expressiveness of the instruction sheet (c.f. "rebase a mergy history
> sucks")

Let's call it "todo list" again?

And yes, I spend a *lot* of time in rebase. It's part of my duties as Git
for Windows maintainer.

> And in that light, I'd like to propose a new naming scheme:
> 
> (a) assume that we "tag" HEAD at the start of the rebase
> (b) any abbreviation must be given as committish anchored to said ref:
> 
> pick REBASE_HEAD~1 commit subject
> pick REBASE_HEAD~2 distak the gostim
> pick REBASE_HEAD~3 Document foo
> pick REBASE_HEAD~4 Testing the star-gazer

I do not necessarily agree that this is better because it is valid only
locally, only in the current rebase (and you enter new problems because
you implicitly require REBASE_HEAD to be worktree-local lest you prevent
simultaneous rebases in related worktrees).

That prevents you from, say, chatting to your colleague and mentioning
that commit that you are uncertain about. Imagine asking "Hey Todd, do you
know what REBASE_HEAD~2 is all about?". Granted, if you ask about
"deadbeef" and it just so happens that this shortened name is ambiguous in
Todd's repository. But that is pretty unlikely, isn't it?

> And as we have the full descriptiveness of the committishs there, each
> commit can be described in a unique way using the graph relationship.  I
> am just throwing the name REBASE_HEAD out there to trigger some
> associations ("similar to FETCH_HEAD"), but I dislike the name.

Not only that. It would also be confusing to read after reordering...

> (c) this would not solve the problem of mergy history, yet. For that
> we'd need to introduce more keywords, that allow us to move around in
> the DAG, [...]

You probably missed my hint that I have a working solution for that.
Please have a look at

https://github.com/git/git/pull/447

Short version: for a commit topology like this:

A - B - C
  \   /
D

the generated list would look like this:

# branch D
pick 0123 A
label branch-point
pick 1234 D
label D

reset branch-point
pick 2345 B
merge 3456 D C

Ciao,
Dscho


Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-10 Thread Jeff King
On Tue, Jan 09, 2018 at 02:05:09PM +0100, Johannes Schindelin wrote:

> > I think that could be easily worked around for rebase by asking git to
> > check ambiguity during the conversion.
> 
> Sure.
> 
> It also points to a flaw in your reasoning, and you should take my example
> further: previously, we guaranteed unique abbreviations, and who is to say
> that there is no script out there relying on that guarantee? You?

I'm not sure what you think my reasoning was, since I made the same
point directly after. ;)

> > I am a bit curious if there's a bounded probability that people would
> > find acceptable for Git to give an ambiguous abbreviation. We already
> > accept 1 in 2^160, of course. But would, e.g., one in a million be OK?
> 
> What is that going to solve?

I'm not sure I understand your question. We can bound the probability of
a collision by increasing the size of the abbreviation. My question was
whether there is a size where that tradeoff makes sense. So it "solves"
the problem of worrying about collisions.

> I think a better alternative would be to introduce a new abbreviation mode
> that is *intended* to stop caring about unique abbreviations.
> 
> In web interfaces, for example, it makes tons of sense to show, say, 8
> digits in link texts and have the full name in the actual link URL.

I'm not sure if that would be useful option or not. I certainly agree
that static abbreviations are a useful thing to want. But for scripted
callers, it's usually easy to cut down the abbreviation yourself. I
think your web interface example is a good one. The caller will ask Git
for the full 40-hex hash, and then cut it down itself when generating
the link (and I just so happen to have worked on a web interface that
does this[1]).

So would it be useful for humans to use? That's what I'm not sure about.
It seems cumbersome to have to add "--abbrev=8!" on the command line to
each invocation. But if it were triggered by config, it seems like we
would risk breaking scripts.

> I am just hesitant to change things that would break existing setups.

Me too. I'm not sure if I'm seriously advocating the "bound the
probability" thing. It's just something I think is worth pondering a
little.

-Peff

[1] Actually, there _is_ something that it's useful for Git to tell the
caller, which is the approximate number of objects in the
repository. Eight digits is not enough for the kernel, for example,
and these days we use 12 digits by default. I'm not sure if such a
web interface would rather ask for "%H %h" and generate the link
text from the second half, or it would rather just get a ballpark on
the number of objects, and do the simple abbreviation math itself
(right now GitHub does not do either; I don't know about other
interfaces).


Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-09 Thread Junio C Hamano
Stefan Beller  writes:

> Johannes wrote:
>> I think a better alternative would be to introduce a new abbreviation mode
>> that is *intended* to stop caring about unique abbreviations.
>>
>> In web interfaces, for example, it makes tons of sense to show, say, 8
>> digits in link texts and have the full name in the actual link URL.
>
> And that is what (b) would solve, as it is shorter than the full hash and
> yet exact.

I still do not get it, even though I fully agree that in Web UI what
Dscho envisions makes tons of sense.  Use some short handle that
does not need to be unique inside repository to display, but have a
full information that can be used by machines.  The shortened ones
need to be unique _within_ a given todo list, to be displayed as
text to be "clicked", where the A element's href attribute that
surrounds that "clickable" text has fully unambiguous information.

And that fully unambiguous information, because it is for machine
consumption, can be a full object name without any shortening.

I do not see a need for REBASE_HEAD~$n to make it less robust
(i.e. we now need to worry about making sure it is not lost or moved
while we need it and clean it up when we are done, etc.)


Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-09 Thread Stefan Beller
On Tue, Jan 9, 2018 at 12:12 PM, Junio C Hamano  wrote:
> Stefan Beller  writes:
>
>> And in that light, I'd like to propose a new naming scheme:
>>
>> (a) assume that we "tag" HEAD at the start of the rebase
>> (b) any abbreviation must be given as committish anchored to said ref:
>>
>> pick REBASE_HEAD~1 commit subject
>> pick REBASE_HEAD~2 distak the gostim
>> pick REBASE_HEAD~3 Document foo
>> pick REBASE_HEAD~4 Testing the star-gazer
>>
>> And as we have the full descriptiveness of the committishs there,
>> each commit can be described in a unique way using the graph relationship.
>> I am just throwing the name REBASE_HEAD out there to trigger some 
>> associations
>> ("similar to FETCH_HEAD"), but I dislike the name.
>>
>> (c) this would not solve the problem of mergy history, yet. For that we'd 
>> need
>> to introduce more keywords, that allow us to move around in the DAG,
>> such that we can reset to a specific revision or name revisions on the 
>> fly.
>> So maybe all we need is "reset", "name" (== short lived tag),
>> "merge" (rewrite parents of HEAD) to be expressive enough, but still keep
>> the line oriented interface?
>
> It is correct to say that (c) is an issue that is not solved by (b),
> but with the current scheme, the commits are internally referenced
> by full object names, and just before it is presented to the end
> users, these names are abbreviated down to unique prefix.  The
> machinery expands these abbreviated names back to the full names
> before going forward, so it is not an issue that we are creating new
> commits during the rebase.
>
> Does it make it easier to read if we used ~$n notation based on a
> fixed point, instead of shortened unique object names?  What
> improvement is (b) trying to achieve?
>

Johannes wrote:
> I think a better alternative would be to introduce a new abbreviation mode
> that is *intended* to stop caring about unique abbreviations.
>
> In web interfaces, for example, it makes tons of sense to show, say, 8
> digits in link texts and have the full name in the actual link URL.

And that is what (b) would solve, as it is shorter than the full hash and
yet exact.

(c) was mostly speculation on top of (b) if we can take it any further.


Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-09 Thread Junio C Hamano
Stefan Beller  writes:

> And in that light, I'd like to propose a new naming scheme:
>
> (a) assume that we "tag" HEAD at the start of the rebase
> (b) any abbreviation must be given as committish anchored to said ref:
>
> pick REBASE_HEAD~1 commit subject
> pick REBASE_HEAD~2 distak the gostim
> pick REBASE_HEAD~3 Document foo
> pick REBASE_HEAD~4 Testing the star-gazer
>
> And as we have the full descriptiveness of the committishs there,
> each commit can be described in a unique way using the graph relationship.
> I am just throwing the name REBASE_HEAD out there to trigger some associations
> ("similar to FETCH_HEAD"), but I dislike the name.
>
> (c) this would not solve the problem of mergy history, yet. For that we'd need
> to introduce more keywords, that allow us to move around in the DAG,
> such that we can reset to a specific revision or name revisions on the 
> fly.
> So maybe all we need is "reset", "name" (== short lived tag),
> "merge" (rewrite parents of HEAD) to be expressive enough, but still keep
> the line oriented interface?

It is correct to say that (c) is an issue that is not solved by (b),
but with the current scheme, the commits are internally referenced
by full object names, and just before it is presented to the end
users, these names are abbreviated down to unique prefix.  The
machinery expands these abbreviated names back to the full names
before going forward, so it is not an issue that we are creating new
commits during the rebase.

Does it make it easier to read if we used ~$n notation based on a
fixed point, instead of shortened unique object names?  What
improvement is (b) trying to achieve?





Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-09 Thread Stefan Beller
On Tue, Jan 9, 2018 at 5:05 AM, Johannes Schindelin
 wrote:
> Hi Peff,
>
> On Tue, 9 Jan 2018, Jeff King wrote:
>
>> On Mon, Jan 08, 2018 at 02:43:00PM +0100, Johannes Schindelin wrote:
>>
>> > Take the interactive rebase for example. It generates todo lists with
>> > abbreviated commit names, for readability (and it is *really* important to
>> > keep this readable). As we expect new objects to be introduced by the
>> > interactive rebase, we convert that todo list to unabbreviated commit
>> > names before executing the interactive rebase.
>> >
>> > Your idea (to not care about unambiguous abbreviations) would break that.
>>
>> I think that could be easily worked around for rebase by asking git to
>> check ambiguity during the conversion.
>
> Sure.
>
> It also points to a flaw in your reasoning, and you should take my example
> further: previously, we guaranteed unique abbreviations, and who is to say
> that there is no script out there relying on that guarantee? You?
>
>> But I agree it's a potential problem for other scripts that we might not
>> have control over. I hadn't really intended this to be the default
>> behavior (my patch was just trying to show the direction). But it does
>> make for a pretty awful interface if callers have to opt into it
>> manually ("git log --oneline --no-really-go-fast").
>
> Exactly.
>
>> I am a bit curious if there's a bounded probability that people would
>> find acceptable for Git to give an ambiguous abbreviation. We already
>> accept 1 in 2^160, of course. But would, e.g., one in a million be OK?
>
> What is that going to solve?
>
> I think a better alternative would be to introduce a new abbreviation mode
> that is *intended* to stop caring about unique abbreviations.
>
> In web interfaces, for example, it makes tons of sense to show, say, 8
> digits in link texts and have the full name in the actual link URL.
>
> Currently, we have no way to tell Git: look, I want to have a short label
> for this, but I do not really care that this is unambiguous, I just want
> something to talk about.
>
> (And you are correct, of course, that the uniqueness of abbreviations will
> no longer be guaranteed for partial clones, and it is already not
> guaranteed for shallow clones, and it will only be possible to guarantee
> this, ever, for one particular repository at a particular time.)
>
> I am just hesitant to change things that would break existing setups.
>
> Just to throw this out there: --abbrev=8! would be one possible convention
> to state "I want exactly 8 hex digits, don't bother checking for
> uniqueness".
>
> Not sure how urgent such a feature is.

You seem to have spent some time on rebase, including lamenting on the
in-expressiveness of the instruction sheet (c.f. "rebase a mergy history sucks")

And in that light, I'd like to propose a new naming scheme:

(a) assume that we "tag" HEAD at the start of the rebase
(b) any abbreviation must be given as committish anchored to said ref:

pick REBASE_HEAD~1 commit subject
pick REBASE_HEAD~2 distak the gostim
pick REBASE_HEAD~3 Document foo
pick REBASE_HEAD~4 Testing the star-gazer

And as we have the full descriptiveness of the committishs there,
each commit can be described in a unique way using the graph relationship.
I am just throwing the name REBASE_HEAD out there to trigger some associations
("similar to FETCH_HEAD"), but I dislike the name.

(c) this would not solve the problem of mergy history, yet. For that we'd need
to introduce more keywords, that allow us to move around in the DAG,
such that we can reset to a specific revision or name revisions on the fly.
So maybe all we need is "reset", "name" (== short lived tag),
"merge" (rewrite parents of HEAD) to be expressive enough, but still keep
the line oriented interface?

Stefan


Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-09 Thread Johannes Schindelin
Hi Peff,

On Tue, 9 Jan 2018, Jeff King wrote:

> On Mon, Jan 08, 2018 at 02:43:00PM +0100, Johannes Schindelin wrote:
> 
> > Take the interactive rebase for example. It generates todo lists with
> > abbreviated commit names, for readability (and it is *really* important to
> > keep this readable). As we expect new objects to be introduced by the
> > interactive rebase, we convert that todo list to unabbreviated commit
> > names before executing the interactive rebase.
> > 
> > Your idea (to not care about unambiguous abbreviations) would break that.
> 
> I think that could be easily worked around for rebase by asking git to
> check ambiguity during the conversion.

Sure.

It also points to a flaw in your reasoning, and you should take my example
further: previously, we guaranteed unique abbreviations, and who is to say
that there is no script out there relying on that guarantee? You?

> But I agree it's a potential problem for other scripts that we might not
> have control over. I hadn't really intended this to be the default
> behavior (my patch was just trying to show the direction). But it does
> make for a pretty awful interface if callers have to opt into it
> manually ("git log --oneline --no-really-go-fast").

Exactly.

> I am a bit curious if there's a bounded probability that people would
> find acceptable for Git to give an ambiguous abbreviation. We already
> accept 1 in 2^160, of course. But would, e.g., one in a million be OK?

What is that going to solve?

I think a better alternative would be to introduce a new abbreviation mode
that is *intended* to stop caring about unique abbreviations.

In web interfaces, for example, it makes tons of sense to show, say, 8
digits in link texts and have the full name in the actual link URL.

Currently, we have no way to tell Git: look, I want to have a short label
for this, but I do not really care that this is unambiguous, I just want
something to talk about.

(And you are correct, of course, that the uniqueness of abbreviations will
no longer be guaranteed for partial clones, and it is already not
guaranteed for shallow clones, and it will only be possible to guarantee
this, ever, for one particular repository at a particular time.)

I am just hesitant to change things that would break existing setups.

Just to throw this out there: --abbrev=8! would be one possible convention
to state "I want exactly 8 hex digits, don't bother checking for
uniqueness".

Not sure how urgent such a feature is.

Ciao,
Dscho


Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-08 Thread Jeff King
On Mon, Jan 08, 2018 at 08:43:44AM -0500, Derrick Stolee wrote:

> > Just to make sure I'm parsing this correctly: normal lookups do get faster
> > when you have a single index, given the right setup?
> > 
> > I'm curious what that setup looked like. Is it just tons and tons of
> > packs? Is it ones where the packs do not follow the mru patterns very
> > well?
> 
> The way I repacked the Linux repo creates an artificially good set of packs
> for the MRU cache. When the packfiles are partitioned instead by the time
> the objects were pushed to a remote, the MRU cache performs poorly.
> Improving these object lookups are a primary reason for the MIDX feature,
> and almost all commands improve because of it. 'git log' is just the
> simplest to use for demonstration.

Interesting.

The idea of the pack mru (and the "last pack looked in" before it) is
that there would be good locality for time-segmented packs (like those
generated by pushes), since most operations also tend to visit history
in chronological-ish order (e.g., log). Tree-wide operations would be an
exception there, though, since files would have many ages across the
tree (though in practice, one would hope that pretty-ancient history
would eventually be replaced in a modern tree).

I've often wondered, though, if the mru (and "last pack") work mainly
because the "normal" distribution for a repository is to have one big
pack with most of history, and then a couple of smaller ones (what
hasn't been packed yet). Even something as naive as "look in the last
pack" works there, because it turns into "look in the biggest pack".  If
it has most of the objects, it's the most likely place for the previous
and the next objects to be.

But from my understanding of how you handle the Windows repository, you
have tons of equal-sized packs that are never coalesced. Which is quite
a different pattern.

I would expect your "--max-pack-size" thing to end up with something
roughly segmented like pushes, though, just because we do order the
write phase in reverse chronological order. Well, mostly. We do each
object type in its own chunks, and there's some reordering based on
deltas. So given the sizes, it's likely that most of your trees all end
up in a few packs.

> > There may be other reasons to want MIDX or something like it, but I just
> > wonder if we can do this much simpler thing to cover the abbreviation
> > case. I guess the question is whether somebody is going to be annoyed in
> > the off chance that they hit a collision.
> 
> No only are users going to be annoyed when they hit collisions after
> copy-pasting an abbreviated hash, there are also a large number of tools
> that people build that use abbreviated hashes (either for presenting to
> users or because they didn't turn off abbreviations).
>
> Abbreviations cause performance issues in other commands, too (like
> 'fetch'!), so whatever short-circuit you put in, it would need to be global.
> A flag on one builtin would not suffice.

Yeah, I agree it's potentially problematic for a lot of reasons. I was
just hoping we could bound the probabilities in a way that is
acceptable. Maybe it's a fool's errand.

-Peff


Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-08 Thread Jeff King
On Mon, Jan 08, 2018 at 02:43:00PM +0100, Johannes Schindelin wrote:

> Take the interactive rebase for example. It generates todo lists with
> abbreviated commit names, for readability (and it is *really* important to
> keep this readable). As we expect new objects to be introduced by the
> interactive rebase, we convert that todo list to unabbreviated commit
> names before executing the interactive rebase.
> 
> Your idea (to not care about unambiguous abbreviations) would break that.

I think that could be easily worked around for rebase by asking git to
check ambiguity during the conversion. Speed is much less of a problem
there, because we're doing a relatively small number of abbreviations
(compared to "git log --oneline --raw", which is abbreviating almost
every object in the repository).

But I agree it's a potential problem for other scripts that we might not
have control over. I hadn't really intended this to be the default
behavior (my patch was just trying to show the direction). But it does
make for a pretty awful interface if callers have to opt into it
manually ("git log --oneline --no-really-go-fast").

I am a bit curious if there's a bounded probability that people would
find acceptable for Git to give an ambiguous abbreviation. We already
accept 1 in 2^160, of course. But would, e.g., one in a million be OK?

-Peff


Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-08 Thread Derrick Stolee

On 1/8/2018 5:20 AM, Jeff King wrote:

On Sun, Jan 07, 2018 at 07:08:54PM -0500, Derrick Stolee wrote:


(Not a critique of this, just a (stupid) question)

What's the practical use-case for this feature? Since it doesn't help
with --abbrev=40 the speedup is all in the part that ensures we don't
show an ambiguous SHA-1.

The point of including the --abbrev=40 is to point out that object lookups
do not get slower with the MIDX feature. Using these "git log" options is a
good way to balance object lookups and abbreviations with object parsing and
diff machinery. And while the public data shape I shared did not show a
difference, our private testing of the Windows repository did show a
valuable improvement when isolating to object lookups and ignoring
abbreviation calculations.

Just to make sure I'm parsing this correctly: normal lookups do get faster
when you have a single index, given the right setup?

I'm curious what that setup looked like. Is it just tons and tons of
packs? Is it ones where the packs do not follow the mru patterns very
well?


The way I repacked the Linux repo creates an artificially good set of 
packs for the MRU cache. When the packfiles are partitioned instead by 
the time the objects were pushed to a remote, the MRU cache performs 
poorly. Improving these object lookups are a primary reason for the MIDX 
feature, and almost all commands improve because of it. 'git log' is 
just the simplest to use for demonstration.



I think it's worth thinking a bit about, because...


If something cares about both throughput and e.g. is saving the
abbreviated SHA-1s isn't it better off picking some arbitrary size
(e.g. --abbrev=20), after all the default abbreviation is going to show
something as small as possible, which may soon become ambigous after the
next commit.

Unfortunately, with the way the abbreviation algorithms work, using
--abbrev=20 will have similar performance problems because you still need to
inspect all packfiles to ensure there isn't a collision in the first 20 hex
characters.

...if what we primarily care about speeding up is abbreviations, is it
crazy to consider disabling the disambiguation step entirely?

The results of find_unique_abbrev are already a bit of a probability
game. They're guaranteed at the moment of generation, but as more
objects are added, ambiguities may be introduced. Likewise, what's
unambiguous for you may not be for somebody else you're communicating
with, if they have their own clone.

Since we now scale the default abbreviation with the size of the repo,
that gives us a bounded and pretty reasonable probability that we won't
hit a collision at all[1].

I.e., what if we did something like this:

diff --git a/sha1_name.c b/sha1_name.c
index 611c7d24dd..04c661ba85 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -600,6 +600,15 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
*sha1, int len)
if (len == GIT_SHA1_HEXSZ || !len)
return GIT_SHA1_HEXSZ;
  
+	/*

+* A default length of 10 implies a repository big enough that it's
+* getting expensive to double check the ambiguity of each object,
+* and the chance that any particular object of interest has a
+* collision is low.
+*/
+   if (len >= 10)
+   return len;
+
mad.init_len = len;
mad.cur_len = len;
mad.hex = hex;

If I repack my linux.git with "--max-pack-size=64m", I get 67 packs. The
patch above drops "git log --oneline --raw" on the resulting repo from
~150s to ~30s.

With a single pack, it goes from ~33s ~29s. Less impressive, but there's
still some benefit.

There may be other reasons to want MIDX or something like it, but I just
wonder if we can do this much simpler thing to cover the abbreviation
case. I guess the question is whether somebody is going to be annoyed in
the off chance that they hit a collision.


No only are users going to be annoyed when they hit collisions after 
copy-pasting an abbreviated hash, there are also a large number of tools 
that people build that use abbreviated hashes (either for presenting to 
users or because they didn't turn off abbreviations).


Abbreviations cause performance issues in other commands, too (like 
'fetch'!), so whatever short-circuit you put in, it would need to be 
global. A flag on one builtin would not suffice.



-Peff

[1] I'd have to check the numbers, but IIRC we've set the scaling so
 that the chance of having a _single_ collision in the repository is
 less than 50%, and rounding to the conservative side (since each hex
 char gives us 4 bits). And indeed, "git log --oneline --raw" on
 linux.git does not seem to have any collisions at its default of 12
 characters, at least in my clone.

 We could also consider switching core.disambiguate to "commit",
 which makes even a collision less likely to annoy the user.





Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-08 Thread Johannes Schindelin
Hi Peff,

On Mon, 8 Jan 2018, Jeff King wrote:

> On Sun, Jan 07, 2018 at 07:08:54PM -0500, Derrick Stolee wrote:
> 
> > > (Not a critique of this, just a (stupid) question)
> > > 
> > > What's the practical use-case for this feature? Since it doesn't
> > > help with --abbrev=40 the speedup is all in the part that ensures we
> > > don't show an ambiguous SHA-1.
> > 
> > The point of including the --abbrev=40 is to point out that object
> > lookups do not get slower with the MIDX feature. Using these "git log"
> > options is a good way to balance object lookups and abbreviations with
> > object parsing and diff machinery. And while the public data shape I
> > shared did not show a difference, our private testing of the Windows
> > repository did show a valuable improvement when isolating to object
> > lookups and ignoring abbreviation calculations.
> 
> Just to make sure I'm parsing this correctly: normal lookups do get
> faster when you have a single index, given the right setup?
> 
> I'm curious what that setup looked like. Is it just tons and tons of
> packs? Is it ones where the packs do not follow the mru patterns very
> well?
> 
> I think it's worth thinking a bit about, because...
> 
> > > If something cares about both throughput and e.g. is saving the
> > > abbreviated SHA-1s isn't it better off picking some arbitrary size
> > > (e.g. --abbrev=20), after all the default abbreviation is going to show
> > > something as small as possible, which may soon become ambigous after the
> > > next commit.
> > 
> > Unfortunately, with the way the abbreviation algorithms work, using
> > --abbrev=20 will have similar performance problems because you still need to
> > inspect all packfiles to ensure there isn't a collision in the first 20 hex
> > characters.
> 
> ...if what we primarily care about speeding up is abbreviations, is it
> crazy to consider disabling the disambiguation step entirely?

Not crazy. But it would break stuff. Because...

> The results of find_unique_abbrev are already a bit of a probability
> game. They're guaranteed at the moment of generation, but as more
> objects are added, ambiguities may be introduced. Likewise, what's
> unambiguous for you may not be for somebody else you're communicating
> with, if they have their own clone.

... this is only a probability game in the long term, when you consider
new objects to enter from *somewhere*.

But in purely local settings, when we expect no new objects to be
introduced, we do use known-unambiguous abbreviations.

Take the interactive rebase for example. It generates todo lists with
abbreviated commit names, for readability (and it is *really* important to
keep this readable). As we expect new objects to be introduced by the
interactive rebase, we convert that todo list to unabbreviated commit
names before executing the interactive rebase.

Your idea (to not care about unambiguous abbreviations) would break that.

Ciao,
Dscho


Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-08 Thread Ævar Arnfjörð Bjarmason

On Mon, Jan 08 2018, Jeff King jotted:

> On Mon, Jan 08, 2018 at 05:20:29AM -0500, Jeff King wrote:
>
>> I.e., what if we did something like this:
>>
>> diff --git a/sha1_name.c b/sha1_name.c
>> index 611c7d24dd..04c661ba85 100644
>> --- a/sha1_name.c
>> +++ b/sha1_name.c
>> @@ -600,6 +600,15 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
>> *sha1, int len)
>>  if (len == GIT_SHA1_HEXSZ || !len)
>>  return GIT_SHA1_HEXSZ;
>>
>> +/*
>> + * A default length of 10 implies a repository big enough that it's
>> + * getting expensive to double check the ambiguity of each object,
>> + * and the chance that any particular object of interest has a
>> + * collision is low.
>> + */
>> +if (len >= 10)
>> +return len;
>> +
>
> Oops, this really needs to terminate the string in addition to returning
> the length (so it was always printing 40 characters in most cases). The
> correct patch is below, but it performs the same.
>
> diff --git a/sha1_name.c b/sha1_name.c
> index 611c7d24dd..5921298a80 100644
> --- a/sha1_name.c
> +++ b/sha1_name.c
> @@ -600,6 +600,17 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
> *sha1, int len)
>   if (len == GIT_SHA1_HEXSZ || !len)
>   return GIT_SHA1_HEXSZ;
>
> + /*
> +  * A default length of 10 implies a repository big enough that it's
> +  * getting expensive to double check the ambiguity of each object,
> +  * and the chance that any particular object of interest has a
> +  * collision is low.
> +  */
> + if (len >= 10) {
> + hex[len] = 0;
> + return len;
> + }
> +
>   mad.init_len = len;
>   mad.cur_len = len;
>   mad.hex = hex;

That looks much more sensible, leaving aside other potential benefits of
MIDX.

Given the argument Linus made in e6c587c733 ("abbrev: auto size the
default abbreviation", 2016-09-30) maybe we should add a small integer
to the length for good measure, i.e. something like:

if (len >= 10) {
int extra = 2; /* or  just 1? or maybe 0 ... */
hex[len + extra] = 0;
return len + extra;
}

I tried running:

git log --pretty=format:%h --abbrev=7 | perl -nE 'chomp; say 
length'|sort|uniq -c|sort -nr

On several large repos, which forces something like the disambiguation
we had before Linus's patch, on e.g. David Turner's
2015-04-03-1M-git.git test repo it's:

 952858 7
  44541 8
   2861 9
168 10
 17 11
  2 12

And the default abbreviation picks 12. I haven't yet found a case where
it's wrong, but if we wanted to be extra safe we could just add a byte
or two to the SHA-1.


Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-08 Thread Ævar Arnfjörð Bjarmason

On Mon, Jan 08 2018, Derrick Stolee jotted:

> On 1/7/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:
>> If something cares about both throughput and e.g. is saving the
>> abbreviated SHA-1s isn't it better off picking some arbitrary size
>> (e.g. --abbrev=20), after all the default abbreviation is going to show
>> something as small as possible, which may soon become ambigous after the
>> next commit.
>
> Unfortunately, with the way the abbreviation algorithms work, using
> --abbrev=20 will have similar performance problems because you still
> need to inspect all packfiles to ensure there isn't a collision in the
> first 20 hex characters.

I meant (but forgot to write) that this would be some new mode,
e.g. --abbrev=20 --no-abbrev-check which would just perform a substr()
of the 40 character SHA-1. It might be interesting to add that for
reasons completely unrelated to your series.

Thanks for answering the rest.


Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-08 Thread Jeff King
On Mon, Jan 08, 2018 at 05:20:29AM -0500, Jeff King wrote:

> I.e., what if we did something like this:
> 
> diff --git a/sha1_name.c b/sha1_name.c
> index 611c7d24dd..04c661ba85 100644
> --- a/sha1_name.c
> +++ b/sha1_name.c
> @@ -600,6 +600,15 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
> *sha1, int len)
>   if (len == GIT_SHA1_HEXSZ || !len)
>   return GIT_SHA1_HEXSZ;
>  
> + /*
> +  * A default length of 10 implies a repository big enough that it's
> +  * getting expensive to double check the ambiguity of each object,
> +  * and the chance that any particular object of interest has a
> +  * collision is low.
> +  */
> + if (len >= 10)
> + return len;
> +

Oops, this really needs to terminate the string in addition to returning
the length (so it was always printing 40 characters in most cases). The
correct patch is below, but it performs the same.

diff --git a/sha1_name.c b/sha1_name.c
index 611c7d24dd..5921298a80 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -600,6 +600,17 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
*sha1, int len)
if (len == GIT_SHA1_HEXSZ || !len)
return GIT_SHA1_HEXSZ;
 
+   /*
+* A default length of 10 implies a repository big enough that it's
+* getting expensive to double check the ambiguity of each object,
+* and the chance that any particular object of interest has a
+* collision is low.
+*/
+   if (len >= 10) {
+   hex[len] = 0;
+   return len;
+   }
+
mad.init_len = len;
mad.cur_len = len;
mad.hex = hex;


Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-08 Thread Jeff King
On Sun, Jan 07, 2018 at 07:08:54PM -0500, Derrick Stolee wrote:

> > (Not a critique of this, just a (stupid) question)
> > 
> > What's the practical use-case for this feature? Since it doesn't help
> > with --abbrev=40 the speedup is all in the part that ensures we don't
> > show an ambiguous SHA-1.
> 
> The point of including the --abbrev=40 is to point out that object lookups
> do not get slower with the MIDX feature. Using these "git log" options is a
> good way to balance object lookups and abbreviations with object parsing and
> diff machinery. And while the public data shape I shared did not show a
> difference, our private testing of the Windows repository did show a
> valuable improvement when isolating to object lookups and ignoring
> abbreviation calculations.

Just to make sure I'm parsing this correctly: normal lookups do get faster
when you have a single index, given the right setup?

I'm curious what that setup looked like. Is it just tons and tons of
packs? Is it ones where the packs do not follow the mru patterns very
well?

I think it's worth thinking a bit about, because...

> > If something cares about both throughput and e.g. is saving the
> > abbreviated SHA-1s isn't it better off picking some arbitrary size
> > (e.g. --abbrev=20), after all the default abbreviation is going to show
> > something as small as possible, which may soon become ambigous after the
> > next commit.
> 
> Unfortunately, with the way the abbreviation algorithms work, using
> --abbrev=20 will have similar performance problems because you still need to
> inspect all packfiles to ensure there isn't a collision in the first 20 hex
> characters.

...if what we primarily care about speeding up is abbreviations, is it
crazy to consider disabling the disambiguation step entirely?

The results of find_unique_abbrev are already a bit of a probability
game. They're guaranteed at the moment of generation, but as more
objects are added, ambiguities may be introduced. Likewise, what's
unambiguous for you may not be for somebody else you're communicating
with, if they have their own clone.

Since we now scale the default abbreviation with the size of the repo,
that gives us a bounded and pretty reasonable probability that we won't
hit a collision at all[1].

I.e., what if we did something like this:

diff --git a/sha1_name.c b/sha1_name.c
index 611c7d24dd..04c661ba85 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -600,6 +600,15 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
*sha1, int len)
if (len == GIT_SHA1_HEXSZ || !len)
return GIT_SHA1_HEXSZ;
 
+   /*
+* A default length of 10 implies a repository big enough that it's
+* getting expensive to double check the ambiguity of each object,
+* and the chance that any particular object of interest has a
+* collision is low.
+*/
+   if (len >= 10)
+   return len;
+
mad.init_len = len;
mad.cur_len = len;
mad.hex = hex;

If I repack my linux.git with "--max-pack-size=64m", I get 67 packs. The
patch above drops "git log --oneline --raw" on the resulting repo from
~150s to ~30s.

With a single pack, it goes from ~33s ~29s. Less impressive, but there's
still some benefit.

There may be other reasons to want MIDX or something like it, but I just
wonder if we can do this much simpler thing to cover the abbreviation
case. I guess the question is whether somebody is going to be annoyed in
the off chance that they hit a collision.

-Peff

[1] I'd have to check the numbers, but IIRC we've set the scaling so
that the chance of having a _single_ collision in the repository is
less than 50%, and rounding to the conservative side (since each hex
char gives us 4 bits). And indeed, "git log --oneline --raw" on
linux.git does not seem to have any collisions at its default of 12
characters, at least in my clone.

We could also consider switching core.disambiguate to "commit",
which makes even a collision less likely to annoy the user.


Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-07 Thread Derrick Stolee

On 1/7/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:


On Sun, Jan 07 2018, Derrick Stolee jotted:


 git log --oneline --raw --parents

Num Packs | Before MIDX | After MIDX |  Rel % | 1 pack %
--+-+++--
 1 | 35.64 s |35.28 s |  -1.0% |   -1.0%
24 | 90.81 s |40.06 s | -55.9% |  +12.4%
   127 |257.97 s |42.25 s | -83.6% |  +18.6%

The last column is the relative difference between the MIDX-enabled repo
and the single-pack repo. The goal of the MIDX feature is to present the
ODB as if it was fully repacked, so there is still room for improvement.

Changing the command to

 git log --oneline --raw --parents --abbrev=40

has no observable difference (sub 1% change in all cases). This is likely
due to the repack I used putting commits and trees in a small number of
packfiles so the MRU cache workes very well. On more naturally-created
lists of packfiles, there can be up to 20% improvement on this command.

We are using a version of this patch with an upcoming release of GVFS.
This feature is particularly important in that space since GVFS performs
a "prefetch" step that downloads a pack of commits and trees on a daily
basis. These packfiles are placed in an alternate that is shared by all
enlistments. Some users have 150+ packfiles and the MRU misses and
abbreviation computations are significant. Now, GVFS manages the MIDX file
after adding new prefetch packfiles using the following command:

 git midx --write --update-head --delete-expired --pack-dir=


(Not a critique of this, just a (stupid) question)

What's the practical use-case for this feature? Since it doesn't help
with --abbrev=40 the speedup is all in the part that ensures we don't
show an ambiguous SHA-1.


The point of including the --abbrev=40 is to point out that object 
lookups do not get slower with the MIDX feature. Using these "git log" 
options is a good way to balance object lookups and abbreviations with 
object parsing and diff machinery. And while the public data shape I 
shared did not show a difference, our private testing of the Windows 
repository did show a valuable improvement when isolating to object 
lookups and ignoring abbreviation calculations.



The reason we do that at all is because it makes for a prettier UI.


We tried setting core.abbrev=40 on GVFS enlistments to speed up 
performance and the users rebelled against the hideous output. They 
would rather have slower speeds than long hashes.



Are there things that both want the pretty SHA-1 and also care about the
throughput? I'd have expected machine parsing to just use
--no-abbrev-commit.


The --raw flag outputs blob hashes, so the --abbrev=40 covers all hashes.


If something cares about both throughput and e.g. is saving the
abbreviated SHA-1s isn't it better off picking some arbitrary size
(e.g. --abbrev=20), after all the default abbreviation is going to show
something as small as possible, which may soon become ambigous after the
next commit.


Unfortunately, with the way the abbreviation algorithms work, using 
--abbrev=20 will have similar performance problems because you still 
need to inspect all packfiles to ensure there isn't a collision in the 
first 20 hex characters.


Thanks,
-Stolee




Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-07 Thread Ævar Arnfjörð Bjarmason

On Sun, Jan 07 2018, Derrick Stolee jotted:

> git log --oneline --raw --parents
>
> Num Packs | Before MIDX | After MIDX |  Rel % | 1 pack %
> --+-+++--
> 1 | 35.64 s |35.28 s |  -1.0% |   -1.0%
>24 | 90.81 s |40.06 s | -55.9% |  +12.4%
>   127 |257.97 s |42.25 s | -83.6% |  +18.6%
>
> The last column is the relative difference between the MIDX-enabled repo
> and the single-pack repo. The goal of the MIDX feature is to present the
> ODB as if it was fully repacked, so there is still room for improvement.
>
> Changing the command to
>
> git log --oneline --raw --parents --abbrev=40
>
> has no observable difference (sub 1% change in all cases). This is likely
> due to the repack I used putting commits and trees in a small number of
> packfiles so the MRU cache workes very well. On more naturally-created
> lists of packfiles, there can be up to 20% improvement on this command.
>
> We are using a version of this patch with an upcoming release of GVFS.
> This feature is particularly important in that space since GVFS performs
> a "prefetch" step that downloads a pack of commits and trees on a daily
> basis. These packfiles are placed in an alternate that is shared by all
> enlistments. Some users have 150+ packfiles and the MRU misses and
> abbreviation computations are significant. Now, GVFS manages the MIDX file
> after adding new prefetch packfiles using the following command:
>
> git midx --write --update-head --delete-expired --pack-dir=

(Not a critique of this, just a (stupid) question)

What's the practical use-case for this feature? Since it doesn't help
with --abbrev=40 the speedup is all in the part that ensures we don't
show an ambiguous SHA-1.

The reason we do that at all is because it makes for a prettier UI.

Are there things that both want the pretty SHA-1 and also care about the
throughput? I'd have expected machine parsing to just use
--no-abbrev-commit.

If something cares about both throughput and e.g. is saving the
abbreviated SHA-1s isn't it better off picking some arbitrary size
(e.g. --abbrev=20), after all the default abbreviation is going to show
something as small as possible, which may soon become ambigous after the
next commit.


[RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-07 Thread Derrick Stolee
This RFC includes a new way to index the objects in multiple packs
using one file, called the multi-pack index (MIDX).

The commits are split into parts as follows:

[01] - A full design document.

[02] - The full file format for MIDX files.

[03] - Creation of core.midx config setting.

[04-12] - Creation of "midx" builtin that writes, reads, and deletes
  MIDX files.

[13-18] - Consume MIDX files for abbreviations and object loads.

The main goals of this RFC are:

* Determine interest in this feature.

* Find other use cases for the MIDX feature.

* Design a proper command-line interface for constructing and checking
  MIDX files. The current "midx" builtin is probably inadequate.

* Determine what additional changes are needed before the feature can
  be merged. Specifically, I'm interested in the interactions with
  repack and fsck. The current patch also does not update the MIDX on
  a fetch (which adds a packfile) but would be valuable. Whenever
  possible, I tried to leave out features that could be added in a
  later patch.

* Consider splitting this patch into multiple patches, such as:

i. The MIDX design document.
   ii. The command-line interface for building and reading MIDX files.
  iii. Integrations with abbreviations and object lookups.

Please do not send any style nits to this patch, as I expect the code to
change dramatically before we consider merging.

I created three copies of the Linux repo with 1, 24, and 127 packfiles
each using 'git repack -adfF --max-pack-size=[64m|16m]'. These copies
gave significant performance improvements on the following comand:

git log --oneline --raw --parents

Num Packs | Before MIDX | After MIDX |  Rel % | 1 pack %
--+-+++--
1 | 35.64 s |35.28 s |  -1.0% |   -1.0%
   24 | 90.81 s |40.06 s | -55.9% |  +12.4%
  127 |257.97 s |42.25 s | -83.6% |  +18.6%

The last column is the relative difference between the MIDX-enabled repo
and the single-pack repo. The goal of the MIDX feature is to present the
ODB as if it was fully repacked, so there is still room for improvement.

Changing the command to

git log --oneline --raw --parents --abbrev=40

has no observable difference (sub 1% change in all cases). This is likely
due to the repack I used putting commits and trees in a small number of
packfiles so the MRU cache workes very well. On more naturally-created
lists of packfiles, there can be up to 20% improvement on this command.

We are using a version of this patch with an upcoming release of GVFS.
This feature is particularly important in that space since GVFS performs
a "prefetch" step that downloads a pack of commits and trees on a daily
basis. These packfiles are placed in an alternate that is shared by all
enlistments. Some users have 150+ packfiles and the MRU misses and
abbreviation computations are significant. Now, GVFS manages the MIDX file
after adding new prefetch packfiles using the following command:

git midx --write --update-head --delete-expired --pack-dir=

As that release deploys we will gather more specific numbers on the
performance improvements and report them in this thread.

Derrick Stolee (18):
  docs: Multi-Pack Index (MIDX) Design Notes
  midx: specify midx file format
  midx: create core.midx config setting
  midx: write multi-pack indexes for an object list
  midx: create midx builtin with --write mode
  midx: add t5318-midx.sh test script
  midx: teach midx --write to update midx-head
  midx: teach git-midx to read midx file details
  midx: find details of nth object in midx
  midx: use existing midx when writing
  midx: teach git-midx to clear midx files
  midx: teach git-midx to delete expired files
  t5318-midx.h: confirm git actions are stable
  midx: load midx files when loading packs
  midx: use midx for approximate object count
  midx: nth_midxed_object_oid() and bsearch_midx()
  sha1_name: use midx for abbreviations
  packfile: use midx for object loads

 .gitignore   |   1 +
 Documentation/config.txt |   3 +
 Documentation/git-midx.txt   | 106 
 Documentation/technical/multi-pack-index.txt | 149 +
 Documentation/technical/pack-format.txt  |  85 +++
 Makefile |   2 +
 builtin.h|   1 +
 builtin/midx.c   | 352 +++
 cache.h  |   1 +
 command-list.txt |   1 +
 config.c |   5 +
 environment.c|   2 +
 git.c|   1 +
 midx.c   | 850 +++
 midx.h   | 136 +
 packfile.c   |  79 ++-
 packfile.h   |   2 +
 sha1_n