Re: Questions about the hash function transition

Derrick Stolee Wed, 29 Aug 2018 06:27:50 -0700

On 8/29/2018 9:09 AM, Johannes Schindelin wrote:

Hi Jonathan,


On Tue, 28 Aug 2018, Jonathan Nieder wrote:

Johannes Schindelin wrote:

On Thu, 23 Aug 2018, Jonathan Nieder wrote:

Ævar Arnfjörð Bjarmason wrote:

Are we going to need a midx version of these mapping files? How does
midx fit into this picture? Perhaps it's too obscure to worry
about...

That's a great question!  I think the simplest answer is to have a
midx only for the primary object format and fall back to using
ordinary idx files for the others.

The midx format already has a field for hash function (thanks,
Derrick!).

Related: I wondered whether we could simply leverage the midx code for
the bidirectional SHA-1 <-> SHA-256 mapping, as it strikes me as very
similar in concept and challenges.

Interesting: tell me more.

My first instinct is to prefer the idx-based design that is already
described in the design doc.  If we want to change that, we should
have a motivating reason.

Midx is designed to be optional and to not necessarily cover all
objects, so it doesn't seem like a good fit.

It is optional, but shouldn't this mode where a Git repo that needs toknow about two different versions of all files be optional? Or at leasttemporary?

The multi-pack-index is intended to cover all packed objects, so coversthe same number of objects as an IDX-based strategy. If we arerebuilding the repo from scratch by translating the hashes, then "beingtoo big to repack" is probably not a problem, so we would expect asingle IDX file anyway.

In my opinion, whatever we do for the IDX-based approach will need to beduplicated in the multi-pack-index. The multi-pack-index does have anatural mechanism (optional chunks) for inserting this data withoutincrementing the version number.

Right.

What I meant was to leverage the midx code, not the .midx files.

My comment was motivated by my realizing that both the SHA-1 <-> SHA-256
mapping and the MIDX code have to look up (in a *fast* way) information
with hash values as keys. *And* this information is immutable. *And* the
amount of information should grow with new objects being added to the
database.

I'm unsure what this means, as the multi-pack-index simply usesbsearch_hash() to find hashes in the list. The same method is used forIDX lookups.

I know that Stolee performed a bit of performance testing regarding
different data structures to use in MIDX. We could benefit from that
testing by using not only the results from those tests, but also the code.

I did test ways to use something other than bsearch_hash(), such asusing a 65,536-entry fanout table for lookups using the first two bytesof a hash (tl;dr: it speeds things up a bit, but the super-smallimprovement is probably not worth the space and complexity). I've alsotoyed with the idea of using interpolation search inside bsearch_hash(),but I haven't had time to do that.

IIRC one of the insights was that packs are a natural structure that
can be used for the MIDX mapping, too (you could, for example, store the
SHA-1 <-> SHA-256 mapping *only* for objects inside packs, and re-generate
them on the fly for loose objects all the time).

Stolee can speak with much more competence and confidence about this,
though, whereas all of what I said above is me waving my hands quite
frantically.

I understand the hesitation to pair such an important feature (hashtransition) to a feature that hasn't even shipped. We will need to seehow things progress on both fronts to see how mature themulti-pack-index is when we need this transition table.


Thanks,

-Stolee

Re: Questions about the hash function transition

Reply via email to