> Another research problem:  Current, Fossil only delta-compresses
> versions of the same file.  Enhance --compress so that it detects
> different files that happen to share a lot of content  and delta
> encode them against one another.

Idea for a multi-phase process to hopefully avoid the inherent O(n**2)
of the problem as much as possible.

(1) For each blob (*), scan it with a rolling hash (**) and remember
    them in a dictionary H keyed by the generated hash values.

    The dict value per hash key is the list of blobs which generate
    the hash key in question.

(2) Create a matrix C of counters indexed by pairs of blobs. Hopefully
    sparse, so use a dictionary or other sparse data structure again,
    and initialize the counters lazily.

    Go over the dictionary H of (1). Whenever 2 blobs A, B occur
    together in the list for a hash key then that means that these
    blobs very likely share the piece of content which generated that
    hash. If they occur together in the lists of multiple hash keys,
    then they share more.

    So let us increment the counters in C for each pair A, B found
    together in the list for a hash key, and do that for all hash
    keys. This is also O(n**2), however with (much) smaller n.

    At the end of doing that each counter for an A, B tells us how
    much content the two blobs where sharing acrosss the board

(3) Now, the counters are NOT directly comparable, as the attainable
    number is also driven by the sizes of the blobs in question. Two
    small blobs generate a lesser number of hash keys, so their max
    counter values are limited by that.

    So we have to normalize the counters. And there is an asymmetry
    now, for blobs A, B of unequal size. A smaller A may be very
    similar to the larger B (i.e. nearly/fully contained in B), but
    not the other way around. So while we could get away with only 1/2
    n**2 counters after normalization we have to keep both scores, A
    vs B, and B vs A.

(4) Last, treat the set of blobs and scores as a directed graph, with
    the blobs as nodes, and the two arcs between two blobs are
    weighted by the scores from (3).

    Then use some spanning tree algorithm for DAGs to select the
    actual edges, i.e. which blobs get delta-encoded by what other,
    starting, of course, with the edges of highest weight, i.e. the
    most similar blobs.

Even with all that I expect that this will take quite some time,
especially for large repositories.

It might be possible to optimize a bit more by flagging the blobs
which took part in such an operation, and on the next run we only
compare the not-yet flagged blobs to all the flagged ones. That way we
would incrementally extend the spanning tree without touching the
existing edges.

(Ad *) The relevant things to compare are the blobs, not files. Many
       revisions of different files can refer to the same blob, if
       they contain identical content.

(Ad **) Re-use the rolling hash used by the delta-encoder ?  It is
        used to quickly find the parts in the two files similar to
        each other.

-- 
So long,
        Andreas Kupries <[email protected]>
                        <http://core.tcl.tk/akupries/>
        Developer @     Hewlett Packard Enterprise
-------------------------------------------------------------------------------




_______________________________________________
fossil-users mailing list
[email protected]
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

Reply via email to