On Tue, Sep 6, 2016 at 5:44 AM, Junio C Hamano <gits...@pobox.com> wrote:
> By the way, not running xdiff twice would also remove another worry
> I have about correctness, in that the approach depends on xdiff
> machinery to produce byte-for-byte identical result given the same
> pair of input.
As we use different parameters to the xdiff machinery (e.g. context = 1 line)
the output is not byte-for-byte identical.
> The output may currently be reproducible, but that
> is an unnecessary and an expensive thing to rely on.
My original design was to not store the lines in the hashmap but
only pointers to them, such that the additional memory pressure was
assumed less than storing the whole output of the xdiff machinery.
That point is moot though in the current implementation, so it
would be better indeed if we run the xdiff machinery once and store
all its output and then operate on that, even from a memory perspective.
> You may be able to save tons of memory if you do not store the line
> contents duplicated. The first pass callback can tell the line
> numbers in preimage and postimage [*1*], so your record for a
> removed line could be a pair <struct diff_filespec *, long lineno>
> with whatever hash value you need to throw it into the hash bucket.
Yeah I guess I'll go that way in the next patch then.
> I know we use a hash function and a line comparison function that
> are aware of -b/-w comparison in xdiff machinery, but I didn't see
> you use them in your hashtable. Can you match moved lines when
> operating under various whitespace-change-ignoring modes?
> *1* You can learn all sort of things from emit_callback structure;
> if you need to pass more data from the caller of xdi_diff_outf()
> to the callback, you can even add new fields to it.