On 8/29/2018 9:09 AM, Johannes Schindelin wrote:
Hi Jonathan,

On Tue, 28 Aug 2018, Jonathan Nieder wrote:

Johannes Schindelin wrote:
On Thu, 23 Aug 2018, Jonathan Nieder wrote:
Ævar Arnfjörð Bjarmason wrote:
Are we going to need a midx version of these mapping files? How does
midx fit into this picture? Perhaps it's too obscure to worry
about...
That's a great question!  I think the simplest answer is to have a
midx only for the primary object format and fall back to using
ordinary idx files for the others.

The midx format already has a field for hash function (thanks,
Derrick!).
Related: I wondered whether we could simply leverage the midx code for
the bidirectional SHA-1 <-> SHA-256 mapping, as it strikes me as very
similar in concept and challenges.
Interesting: tell me more.

My first instinct is to prefer the idx-based design that is already
described in the design doc.  If we want to change that, we should
have a motivating reason.

Midx is designed to be optional and to not necessarily cover all
objects, so it doesn't seem like a good fit.

It is optional, but shouldn't this mode where a Git repo that needs to know about two different versions of all files be optional? Or at least temporary?

The multi-pack-index is intended to cover all packed objects, so covers the same number of objects as an IDX-based strategy. If we are rebuilding the repo from scratch by translating the hashes, then "being too big to repack" is probably not a problem, so we would expect a single IDX file anyway.

In my opinion, whatever we do for the IDX-based approach will need to be duplicated in the multi-pack-index. The multi-pack-index does have a natural mechanism (optional chunks) for inserting this data without incrementing the version number.

Right.

What I meant was to leverage the midx code, not the .midx files.

My comment was motivated by my realizing that both the SHA-1 <-> SHA-256
mapping and the MIDX code have to look up (in a *fast* way) information
with hash values as keys. *And* this information is immutable. *And* the
amount of information should grow with new objects being added to the
database.

I'm unsure what this means, as the multi-pack-index simply uses bsearch_hash() to find hashes in the list. The same method is used for IDX lookups.

I know that Stolee performed a bit of performance testing regarding
different data structures to use in MIDX. We could benefit from that
testing by using not only the results from those tests, but also the code.

I did test ways to use something other than bsearch_hash(), such as using a 65,536-entry fanout table for lookups using the first two bytes of a hash (tl;dr: it speeds things up a bit, but the super-small improvement is probably not worth the space and complexity). I've also toyed with the idea of using interpolation search inside bsearch_hash(), but I haven't had time to do that.

IIRC one of the insights was that packs are a natural structure that
can be used for the MIDX mapping, too (you could, for example, store the
SHA-1 <-> SHA-256 mapping *only* for objects inside packs, and re-generate
them on the fly for loose objects all the time).

Stolee can speak with much more competence and confidence about this,
though, whereas all of what I said above is me waving my hands quite
frantically.

I understand the hesitation to pair such an important feature (hash transition) to a feature that hasn't even shipped. We will need to see how things progress on both fronts to see how mature the multi-pack-index is when we need this transition table.

Thanks,

-Stolee

Reply via email to