> De : Joerg Sonnenberger <jo...@bec.de> > À : fossil-users@lists.fossil-scm.org > Envoyé le : Dimanche 4 décembre 2016 20h55 > Objet : Re: [fossil-users] Bug report: Terrible Performance, when Checking in LLVM Source > ... > No. What repo checksum does is compute a separate checksum over the > concatenation of all files. > > Joerg > ...
Thank You all for the answers to my previous bug report letters. I have not made up my mind, how to proceed with the large repo case, but 2 things I know for certain: certain_thing_1) Hash algorithms will evolve and whatever Fossil uses in 2016_12 for the checksum, will become obsolete. certain_thing_2) Regardless of what the hash algorithm is, there exists at least one solution, that allows to calculate a checksum of a concatenation of a large set of files without re-calculating the hashes of all of the files that did not change. The naive and slow version: array_of_relative_file_paths=[ file_1, file_2, ..., file_N ] blob_1=null i_len=array_of_relative_file_paths.length if 0<i_len{ for 0 .. (i_len-1) |ix|{ s_fp=array_of_relative_file_paths[ix] x_file_bytestream=read_from_disk(s_fp) blob_1=concat(blob_1, x_file_bytestream) } // for } // if checksum_of_all_files=hash(blob_1) // The moment there is a new file, file_<N+1>, // the blob_1 needs to be, if not allocated like // in the above pseudocode, then at least fed in to the hash(...) // in some stream fashion and the hash(...) has to // re-process the file_1, file_2, ... The gist of the proposed idea is to place various hashes to a tree and then hash both, files and hashes of the files, giving a probabilistic opportunity to avoid running the hash function on all of the files after the collection of files has changed: ob_AVL_tree=AVL_tree_class.new function_instance_1=(f_arg_ob_node_reference,f_arg_x_file_hash){ x_hash_left=hash(null) x_hash_right=hash(null) ob_child_left=f_arg_ob_node_reference.get_child_left() ob_child_right=f_arg_ob_node_reference.get_child_right() if ob_child_left != null { x_hash_left=ob_child_left.record.x_node_hash } // if if ob_child_right != null { x_hash_right=ob_child_right.record.x_node_hash } // if x_bytestream=concat(x_hash_left, "_separator_", x_hash_right "_separator_", f_arg_x_file_hash) return hash(x_bytestream) } array_of_relative_file_paths=[ file_1, file_2, ..., file_N ] i_len=array_of_relative_file_paths.length if 0<i_len{ for 0 .. (i_len-1) |ix|{ s_fp=array_of_relative_file_paths[ix] x_file_bytestream=read_from_disk(s_fp) x_file_hash=hash(x_file_bytestream) ob_AVL_tree.create_or_overwrite_node(s_fp) ob_node_reference=ob_AVL_tree.get_node_reference(s_fp) ob_node_reference.record.x_file_hash=x_file_hash if ob_node_reference.b_is_leaf { ob_node_reference.record.x_node_hash=x_file_hash } else { // The loop below will handle the case. } // else array_of_nodes_with_wrong_x_node_hash=unique_by_node_ID( clone( ob_AVL_tree.array_of_nodes_that_had_changes_on_path_2_root ).concat(clone( ob_AVL_tree.array_of_nodes_that_have_changed_children // change within children means that any // of the children changed during the insertion // or removal of nodes to/from the ob_AVL_tree // after the AVL-tree got (automatically) rebalanced. // A change between null and an existing child is // also considered a change. )) ).sort_by_path_length_from_node_to_root_node loop_over_array_of_nodes_with_wrong_x_node_hash{ |ob_node| // starting with the elements that have the // longest path to the root node x_1=ob_node.record.x_file_hash x_2=function_instance_1.call(ob_node,x_1) ob_node.record.x_node_hash=x2 } // loop } // for } // if checksum_of_all_files=hash(null) if 0 < ob_AVL_tree.number_of_nodes { ob_node=ob_AVL_tree.get_root_node() checksum_of_all_files=ob_node.record.x_node_hash } //if I haven't tested it, but if it works, then it should allow repository checksums to be calculated at, hopefully, much more efficiently, without almost any compromises on the quality of the checksum. Of course, it's easier for me here to talk/write/fantasize than for anyone to actually implement it, but, the example does give a reason to suspect that repository wide checksums can be calculated efficiently at least at some of the cases. That would be my feature request. :-) Thank You. _______________________________________________ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users