Re: We should add a "git gc --auto" after "git clone" due to commit graph

Derrick Stolee Mon, 08 Oct 2018 11:29:54 -0700

On 10/8/2018 2:10 PM, SZEDER Gábor wrote:

On Mon, Oct 08, 2018 at 12:57:34PM -0400, Derrick Stolee wrote:

Nice! These numbers make sense to me, in terms of how many TREESAME queries
we actually need to perform for such a query.

Yeah...  because you didn't notice that I deliberately cheated :)


As it turned out, it's not just about the number of diff queries that
we can spare, but, for the speedup _ratio_, it's more about how
expensive those diff queries are.

git.git has a rather flat hierarchy, and 't/' is the 372th entry in
the current root tree object, while 'valgrind/' is the 923th entry,
and the diff machinery spends considerable time wading through the
previous entries.  Notice the "carefully chosen path" remark in my
previous email; I think this particular path has the highest number of
preceeding tree entries, and, in addition, 't/' changes rather
frequently, so the diff machinery often has to scan two relatively big
tree objects.  Had I chosen 'Documentation/RelNotes/1.5.0.1.txt'
instead, i.e. another path two directories deep, but whose leading
path components are both near the beginning of the tree objects, the
speedup would be much less impressive: 0.282s vs. 0.049s, i.e. "only"
~5.7x instead of ~24.8x.

This is expected. The performance ratio is better when the path is anyof the following:


1. A very deep path (need to walk multiple trees to answer TREESAME)

2. An entry is late in a very wide tree (need to spend extra timeparsing tree object)

3. The path doesn't change very often (need to inspect many TREESAMEpairs before finding enough interesting commits)

4. Some sub-path changes often (so the TREESAME comparison needs toparse beyond that sub-path often)

Our standard examples (Git and Linux repos) don't have many paths thathave these properties. But: they do exist. In other projects, this isactually typical. Think about Java projects that frequently have ~5levels of folders before actually touching a code file.

When I was implementing the Bloom filter feature for Azure Repos, I ranperformance tests on the Linux repo using a random sampling of paths.The typical speedup was 5x while some outliers were in the 25x range.

But I'm afraid it will take a while until I get around to turn it into
something presentable...

Do you have the code pushed somewhere public where one could take a look? I
Do you have the code pushed somewhere public where one could take a
look? I could provide some early feedback.

Nah, definitely not...  I know full well how embarassingly broken this
implementation is, I don't need others to tell me that ;)

There are two questions that I was hoping to answer by looking at your code:

1. How do you store your Bloom filter? Is it connected to thecommit-graph and split on a commit-by-commit basis (storing "$path" as akey), or is it one huge Bloom filter (storing "$commitid:$path" as key)?

2. Where does your Bloom filter check plug into the TREESAME logic? Ihaven't investigated this part, but hopefully it isn't too complicated.


Thanks,
-Stolee

Re: We should add a "git gc --auto" after "git clone" due to commit graph

Reply via email to