Derrick Stolee <> writes:

> On 4/7/2018 2:40 PM, Jakub Narebski wrote:
>> Derrick Stolee <> writes:
>> [...]
>>> On the Linux repository, performance tests were run for the following
>>> command:
>>>      git log --graph --oneline -1000
>>>      Before: 0.92s
>>>      After:  0.66s
>>>      Rel %: -28.3%
>>> Adding '-- kernel/' to the command requires loading the root tree
>>> for every commit that is walked. There was no measureable performance
>>> change as a result of this patch.
>> In the "Git Merge contributor summit notes" [1] one can read that:
>>> - VSTS adds bloom filters to know which paths have changed on the commit
>>> - tree-same check in the bloom filter is fast; speeds up file history checks
>>> - if the file history is _very_ sparse, then bloom filter is useful
>> Could this method speed up also the second case mentioned here?  Can
>> anyone explain how this "path-changed bloom filter" works in VSTS?
> The idea is simple: for every commit, store a Bloom filter containing
> the list of paths that are not TREESAME against the first parent. (A
> slight detail: have a max cap on the number of paths, and store simply
> "TOO_BIG" for commits with too many diffs.)

Isn't Bloom filter with all bits set to 1 equivalent of "too big"? It
would answer each query with "possibly in set, check it".

> When performing 'git log -- path' queries, the most important detail
> for considering how to advance the walk is whether the commit is
> TREESAME to its first parent. For a deep path in a large repo, this is
> almost always true. When a Bloom filter says "TREESAME" (i.e. "this
> path is not in my set") it is always correct, so we can set the
> treesame bit and continue without walking any trees. When a Bloom
> filter says "MAYBE NOT TREESAME" (i.e. "this path is probably in my
> set") you only need to do the same work as before: walk the trees to
> compare against your first parent.
> If a Bloom filter has a false-positive rate of X%, then you can
> possibly drop your number of tree comparisons by (100-X)%. This is
> very important for large repos where some paths were changed only ten
> times or so, the full graph needs to be walked and it is helpful to
> avoid parsing too many trees.

So this works only for exact file pathnames, or does checking for
subdirectory also work?  I guess it cannot work for globbing patterns
(like "git log -- '*.c'").

I guess that it speeds up "git log -- <pathname>", but also "git blame

Sidenote: I have stumbled upon project
(fast-forward Bllom filters - for exact matching of large number of
patterns), where they invent cache-efficient version of Bloom filter
algorithm.  The paper that the project is based on also might be
interesting, if only because it points to other improvements of classic
Bloom filters.

>> Could we add something like this to the commit-graph file? 
> I'm not sure if it is necessary for client-side operations, but it is
> one of the reasons the commit-graph file has the idea of an "optional
> chunk". It could be added to the file format (without changing version
> numbers) and be ignored by clients that don't understand it. I could
> also be gated by a config setting for computing them. My guess is that
> only server-side operations will need the added response time, and can
> bear the cost of computing them when writing the commit-graph
> file. Clients are less likely to be patient waiting for a lot of diff
> calculations.

So the problem is performing large amount of diff calculations.  I
wonder if it would be possible to add Bloom filters incrementally, when
for some reason we need to calculate diff anyway.

> If we add commit-graph file downloads to the protocol, then the server
> could do this computation and send the data to all clients. But that
> would be "secondary" information that maybe clients want to verify,
> which is as difficult as computing it themselves.

Can you tell us how much work (how much time) must spend the server to
calculate those per-commit "files changed" Bloom fillters?

Jakub Narębski

Reply via email to