Re: [fossil-users] Repo Checksum Speedup Idea: flaw in my comment
On Mon, 5 Dec 2016 20:23:50 +0200 Martin Vahiwrote: > As it turns out, I already made a mistake > at the tree based algorithm. > > The old, proposed, flawed version: > > ... > > array_of_nodes_with_wrong_x_node_hash=unique_by_node_ID( > > clone( > > ob_AVL_tree.array_of_nodes_that_had_changes_on_path_2_root > > ).concat(clone( > > ob_AVL_tree.array_of_nodes_that_have_changed_children > > // change within children means that any > > // of the children changed during the insertion > > // or removal of nodes to/from the ob_AVL_tree > > // after the AVL-tree got (automatically) rebalanced. > > // A change between null and an existing child is > > // also considered a change. > > )) > > ).sort_by_path_length_from_node_to_root_node > > > > ... > > A newer version looks only the changes within > children: > > array_of_nodes_with_wrong_x_node_hash=unique_by_node_ID( > clone( > ob_AVL_tree.array_of_nodes_that_have_changed_children > )).sort_by_path_length_from_node_to_root_node Do you mean a Tree Hash like in dc/adc, bittorrent, emule/edonkey ? They use TTH (Tiger Tree Hash), computing hashs of chunks/flie parts, and from them calculates hash of groups of hashes. This way they know in logN steps what chunk is corrupted, and download again. If yes, I don't think it could replace current checksum. It solves a different problem. If no, I'll reread your mails. > Thank You. --- --- Eduardo Morras ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
[fossil-users] Repo Checksum Speedup Idea: flaw in my comment
As it turns out, I already made a mistake at the tree based algorithm. The old, proposed, flawed version: > ... > array_of_nodes_with_wrong_x_node_hash=unique_by_node_ID( > clone( > ob_AVL_tree.array_of_nodes_that_had_changes_on_path_2_root > ).concat(clone( > ob_AVL_tree.array_of_nodes_that_have_changed_children > // change within children means that any > // of the children changed during the insertion > // or removal of nodes to/from the ob_AVL_tree > // after the AVL-tree got (automatically) rebalanced. > // A change between null and an existing child is > // also considered a change. > )) > ).sort_by_path_length_from_node_to_root_node > > ... A newer version looks only the changes within children: array_of_nodes_with_wrong_x_node_hash=unique_by_node_ID( clone( ob_AVL_tree.array_of_nodes_that_have_changed_children )).sort_by_path_length_from_node_to_root_node Thank You. ___ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
[fossil-users] Repo Checksum Speedup Idea
> De : Joerg Sonnenberger> À : fossil-users@lists.fossil-scm.org > Envoyé le : Dimanche 4 décembre 2016 20h55 > Objet : Re: [fossil-users] Bug report: Terrible Performance, when Checking in LLVM Source > ... > No. What repo checksum does is compute a separate checksum over the > concatenation of all files. > > Joerg > ... Thank You all for the answers to my previous bug report letters. I have not made up my mind, how to proceed with the large repo case, but 2 things I know for certain: certain_thing_1) Hash algorithms will evolve and whatever Fossil uses in 2016_12 for the checksum, will become obsolete. certain_thing_2) Regardless of what the hash algorithm is, there exists at least one solution, that allows to calculate a checksum of a concatenation of a large set of files without re-calculating the hashes of all of the files that did not change. The naive and slow version: array_of_relative_file_paths=[ file_1, file_2, ..., file_N ] blob_1=null i_len=array_of_relative_file_paths.length if 0 , // the blob_1 needs to be, if not allocated like // in the above pseudocode, then at least fed in to the hash(...) // in some stream fashion and the hash(...) has to // re-process the file_1, file_2, ... The gist of the proposed idea is to place various hashes to a tree and then hash both, files and hashes of the files, giving a probabilistic opportunity to avoid running the hash function on all of the files after the collection of files has changed: ob_AVL_tree=AVL_tree_class.new function_instance_1=(f_arg_ob_node_reference,f_arg_x_file_hash){ x_hash_left=hash(null) x_hash_right=hash(null) ob_child_left=f_arg_ob_node_reference.get_child_left() ob_child_right=f_arg_ob_node_reference.get_child_right() if ob_child_left != null { x_hash_left=ob_child_left.record.x_node_hash } // if if ob_child_right != null { x_hash_right=ob_child_right.record.x_node_hash } // if x_bytestream=concat(x_hash_left, "_separator_", x_hash_right "_separator_", f_arg_x_file_hash) return hash(x_bytestream) } array_of_relative_file_paths=[ file_1, file_2, ..., file_N ] i_len=array_of_relative_file_paths.length if 0