> Another research problem: Current, Fossil only delta-compresses
> versions of the same file. Enhance --compress so that it detects
> different files that happen to share a lot of content and delta
> encode them against one another.
Idea for a multi-phase process to hopefully avoid the inherent O(n**2)
of the problem as much as possible.
(1) For each blob (*), scan it with a rolling hash (**) and remember
them in a dictionary H keyed by the generated hash values.
The dict value per hash key is the list of blobs which generate
the hash key in question.
(2) Create a matrix C of counters indexed by pairs of blobs. Hopefully
sparse, so use a dictionary or other sparse data structure again,
and initialize the counters lazily.
Go over the dictionary H of (1). Whenever 2 blobs A, B occur
together in the list for a hash key then that means that these
blobs very likely share the piece of content which generated that
hash. If they occur together in the lists of multiple hash keys,
then they share more.
So let us increment the counters in C for each pair A, B found
together in the list for a hash key, and do that for all hash
keys. This is also O(n**2), however with (much) smaller n.
At the end of doing that each counter for an A, B tells us how
much content the two blobs where sharing acrosss the board
(3) Now, the counters are NOT directly comparable, as the attainable
number is also driven by the sizes of the blobs in question. Two
small blobs generate a lesser number of hash keys, so their max
counter values are limited by that.
So we have to normalize the counters. And there is an asymmetry
now, for blobs A, B of unequal size. A smaller A may be very
similar to the larger B (i.e. nearly/fully contained in B), but
not the other way around. So while we could get away with only 1/2
n**2 counters after normalization we have to keep both scores, A
vs B, and B vs A.
(4) Last, treat the set of blobs and scores as a directed graph, with
the blobs as nodes, and the two arcs between two blobs are
weighted by the scores from (3).
Then use some spanning tree algorithm for DAGs to select the
actual edges, i.e. which blobs get delta-encoded by what other,
starting, of course, with the edges of highest weight, i.e. the
most similar blobs.
Even with all that I expect that this will take quite some time,
especially for large repositories.
It might be possible to optimize a bit more by flagging the blobs
which took part in such an operation, and on the next run we only
compare the not-yet flagged blobs to all the flagged ones. That way we
would incrementally extend the spanning tree without touching the
existing edges.
(Ad *) The relevant things to compare are the blobs, not files. Many
revisions of different files can refer to the same blob, if
they contain identical content.
(Ad **) Re-use the rolling hash used by the delta-encoder ? It is
used to quickly find the parts in the two files similar to
each other.
--
So long,
Andreas Kupries <[email protected]>
<http://core.tcl.tk/akupries/>
Developer @ Hewlett Packard Enterprise
-------------------------------------------------------------------------------
_______________________________________________
fossil-users mailing list
[email protected]
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users