[fossil-users] Repo Checksum Speedup Idea

Martin Vahi Mon, 05 Dec 2016 10:12:44 -0800

>       De : Joerg Sonnenberger <jo...@bec.de>
>  À : fossil-users@lists.fossil-scm.org
>  Envoyé le : Dimanche 4 décembre 2016 20h55
>  Objet : Re: [fossil-users] Bug report: Terrible Performance, when
Checking in LLVM Source
> ...
> No. What repo checksum does is compute a separate checksum over the
> concatenation of all files.
>
> Joerg
> ...


Thank You all for the answers to my previous
bug report letters. I have not made up my mind,
how to proceed with the large repo case, but
2 things I know for certain:

certain_thing_1)
    Hash algorithms will evolve and whatever
    Fossil uses in 2016_12 for the checksum,
    will become obsolete.

certain_thing_2)
    Regardless of what the hash algorithm is,
    there exists at least one solution, that
    allows to calculate a checksum of a
    concatenation of a large set of files
    without re-calculating the hashes of all
    of the files that did not change.


The naive and slow version:

    array_of_relative_file_paths=[ file_1, file_2, ..., file_N ]
    blob_1=null
    i_len=array_of_relative_file_paths.length
    if 0<i_len{
        for 0 .. (i_len-1) |ix|{
            s_fp=array_of_relative_file_paths[ix]
            x_file_bytestream=read_from_disk(s_fp)

            blob_1=concat(blob_1, x_file_bytestream)
        } // for
    } // if
    checksum_of_all_files=hash(blob_1)

    // The moment there is a new file, file_<N+1>,
    // the blob_1 needs to be, if not allocated like
    // in the above pseudocode, then at least fed in to the hash(...)
    // in some stream fashion and the hash(...) has to
    // re-process the file_1, file_2, ...


The gist of the proposed idea is to place
various hashes to a tree and then hash both, files
and hashes of the files, giving a probabilistic opportunity
to avoid running the hash function on all of the files
after the collection of files has changed:

    ob_AVL_tree=AVL_tree_class.new

    function_instance_1=(f_arg_ob_node_reference,f_arg_x_file_hash){
        x_hash_left=hash(null)
        x_hash_right=hash(null)

        ob_child_left=f_arg_ob_node_reference.get_child_left()
        ob_child_right=f_arg_ob_node_reference.get_child_right()
        if ob_child_left != null {
            x_hash_left=ob_child_left.record.x_node_hash
        } // if
        if ob_child_right != null {
            x_hash_right=ob_child_right.record.x_node_hash
        } // if

        x_bytestream=concat(x_hash_left,
                            "_separator_",
                            x_hash_right
                            "_separator_",
                            f_arg_x_file_hash)
        return hash(x_bytestream)
        }

    array_of_relative_file_paths=[ file_1, file_2, ..., file_N ]
    i_len=array_of_relative_file_paths.length
    if 0<i_len{
        for 0 .. (i_len-1) |ix|{
            s_fp=array_of_relative_file_paths[ix]
            x_file_bytestream=read_from_disk(s_fp)

            x_file_hash=hash(x_file_bytestream)
            ob_AVL_tree.create_or_overwrite_node(s_fp)
            ob_node_reference=ob_AVL_tree.get_node_reference(s_fp)
            ob_node_reference.record.x_file_hash=x_file_hash

            if ob_node_reference.b_is_leaf {
                ob_node_reference.record.x_node_hash=x_file_hash
            } else {
                // The loop below will handle the case.
            } // else

            array_of_nodes_with_wrong_x_node_hash=unique_by_node_ID(
              clone(
              ob_AVL_tree.array_of_nodes_that_had_changes_on_path_2_root
              ).concat(clone(
              ob_AVL_tree.array_of_nodes_that_have_changed_children
              // change within children means that any
              // of the children changed during the insertion
              // or removal of nodes to/from the ob_AVL_tree
              // after the AVL-tree got (automatically) rebalanced.
              // A change between null and an existing child is
              // also considered a change.
              ))
              ).sort_by_path_length_from_node_to_root_node

            loop_over_array_of_nodes_with_wrong_x_node_hash{ |ob_node|
                // starting with the elements that have the
                // longest path to the root node

                x_1=ob_node.record.x_file_hash
                x_2=function_instance_1.call(ob_node,x_1)
                ob_node.record.x_node_hash=x2
                } // loop

        } // for
    } // if
    checksum_of_all_files=hash(null)
    if 0 < ob_AVL_tree.number_of_nodes {
        ob_node=ob_AVL_tree.get_root_node()
        checksum_of_all_files=ob_node.record.x_node_hash
    } //if


I haven't tested it, but if it works, then
it should allow repository checksums to be
calculated at, hopefully, much more efficiently,
without almost any compromises on the quality
of the checksum.

Of course, it's easier for me here to talk/write/fantasize
than for anyone to actually implement it, but, the example
does give a reason to suspect that repository wide
checksums can be calculated efficiently at least at
some of the cases.

That would be my feature request. :-)

Thank You.
_______________________________________________
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

[fossil-users] Repo Checksum Speedup Idea

Reply via email to