Am 11.08.2018 um 19:23 schrieb Jeff King:
On Sat, Aug 11, 2018 at 01:02:48PM -0400, Jeff King wrote:
- we could probably improve the speed of oidset. Two things I notice
about its implementation:
Before any optimizations, my best-of-five timing for:
git cat-file --batch-all-objects --unordered --buffer \
in git.git was:
That's enumerating every object in the repo but not doing much more than
de-duping the names and printing them.
So a single hash lookup per
object does not seem to be measurable. And thus teaching oidset_insert()
to do a single hash lookup for check-and-insert is unlikely to help us.
On top of that, I tried using a pool to store the set entries:
That drops my best-of-five to:
which is over a 25% speedup. So that does seem worth pursuing.
For reference, the oid_array code path for cat-file is still:
but that's not completely apples to apples. The oidset code is also
iterating the packfiles in a different order and generating a revidx
(which I know is about 25ms in this repo). So a better test would
actually swap one data structure out for the other with no other changes
(I just happened to have this test handy, and really wanted to know
whether the mem_pool stuff would help).
If the current oidset implementation is so bad, why not replace it with
one based on oid_array? ;-)
Intuitively I'd try a hashmap with no payload and open addressing via
sha1hash(), which should reduce memory allocations quite a bit -- no
need to store hash codes and next pointers, only an array of object IDs
with a fill rate of 50% or so. Deletions are a bit awkward with that
scheme, though; they could perhaps be implemented as insertions into a
Balancing might become a problem. Your cat-file example should be fine,
but someone could decide to add only hashes with a certain prefix to
their skiplist, and lookups would lopsided.
But first we'd need something like test-sha1-array for oidset and
some kind of performance tests based on these helpers, right?