Stefan Beller <sbel...@google.com> writes:
> This new coloring is linear to the size of the patch, i.e. O(number of
> added/removed lines) in memory and for computational efforts I'd
> think it is O(n log n) as inserting into the hashmap is an amortized
> log n.
In addition to that O(n log n) overhead for book-keeping, you are
doing at least twice the amount of work compared to the original, if
you are still running the same xdiff twice to implement the two pass
approach. That is why I thought this "twice as slow, at least"
needs to be off by default at least in the beginning.
Also there is the additional memory pressure coming from the fact
that the first pass will need to keep all the removed and added
lines in-core for all filepairs. If you keep the entire diff output
in-core from the first pass, I do not think it would be that much
more memory overhead compared to what you are already doing, so the
cost of running the same xdiff twice is relatively easy to reduce, I
would imagine? Instead of running the second xdi_diff_outf(), you
can drive your callback function out of what has been queued in the
first pass yourself. But that is the next step "optimization" once
we got the basics right.
By the way, not running xdiff twice would also remove another worry
I have about correctness, in that the approach depends on xdiff
machinery to produce byte-for-byte identical result given the same
pair of input. The output may currently be reproducible, but that
is an unnecessary and an expensive thing to rely on.
You may be able to save tons of memory if you do not store the line
contents duplicated. The first pass callback can tell the line
numbers in preimage and postimage [*1*], so your record for a
removed line could be a pair <struct diff_filespec *, long lineno>
with whatever hash value you need to throw it into the hash bucket.
I know we use a hash function and a line comparison function that
are aware of -b/-w comparison in xdiff machinery, but I didn't see
you use them in your hashtable. Can you match moved lines when
operating under various whitespace-change-ignoring modes?
*1* You can learn all sort of things from emit_callback structure;
if you need to pass more data from the caller of xdi_diff_outf()
to the callback, you can even add new fields to it.