On Sat, Apr 08, 2017 at 04:06:41PM +0200, René Scharfe wrote:

> > +           } else if (i > 1 && are_same_oid(&names[i], &names[i-2])) {
> > +                   /* implicitly borrow buf[i-2] inside tree_desc[i] */
> > +                   memcpy(&t[i], &t[i-2], sizeof(struct tree_desc));
> 
> Similar case.
> 
> > +                   buf[i] = NULL;
> 
> What makes the previous two entries special, or differently put: Why not
> just check *all* previous entries?  MAX_UNPACK_TREES is 8; the number of
> comparisons would just about double in the worst case and stay the same for
> three trees or less.  The order of four trees or more wouldn't matter
> anymore.

If I understand correctly, we've got N (up to 8) trees, and we want to
find sets of duplicates. Adjacency doesn't matter. Aren't our options
basically:

  - look through all previous entries naively, O(N^2)

  - sort the list, O(N log N); but we need the original order, so we'd
    have to unsort at the end, too

  - use a hash table, O(N) but with O(N) extra storage

I know N=8 isn't very big algorithmically speaking, but it would feel
ugly to introduce something quadratic. Especially the limit of 8 seems
rather arbitrary. In normal cases we see at most a 3-way merge. Beyond
that we're in octopus territory, and 8 is way too low there (I think the
octopus strategy just unpacks one at a time and barfs if there are any
conflicts).

I assume the rationale behind "check the last 2" is just "this
optimization will kick in reliably for all sane cases", weird octopus
unpack-trees be damned (though reading ca885a4fe, it sounds like 4-trees
can actually happen, but I'm not clear on how).

-Peff

Reply via email to