Re: [fossil-users] Repo Checksum Speedup Idea: flaw in my comment

2016-12-06 Thread Eduardo Morras
On Mon, 5 Dec 2016 20:23:50 +0200
Martin Vahi  wrote:

> As it turns out, I already made a mistake
> at the tree based algorithm.
> 
> The old, proposed, flawed version:
> > ...
> > array_of_nodes_with_wrong_x_node_hash=unique_by_node_ID(
> >   clone(
> >   ob_AVL_tree.array_of_nodes_that_had_changes_on_path_2_root
> >   ).concat(clone(
> >   ob_AVL_tree.array_of_nodes_that_have_changed_children
> >   // change within children means that any
> >   // of the children changed during the insertion
> >   // or removal of nodes to/from the ob_AVL_tree
> >   // after the AVL-tree got (automatically) rebalanced.
> >   // A change between null and an existing child is
> >   // also considered a change.
> >   ))
> >   ).sort_by_path_length_from_node_to_root_node
> > 
> > ...
> 
> A newer version looks only the changes within
> children:
> 
> array_of_nodes_with_wrong_x_node_hash=unique_by_node_ID(
>   clone(
>   ob_AVL_tree.array_of_nodes_that_have_changed_children
>   )).sort_by_path_length_from_node_to_root_node

Do you mean a Tree Hash like in dc/adc, bittorrent, emule/edonkey ?  They use 
TTH (Tiger Tree Hash), computing hashs of chunks/flie parts, and from them 
calculates hash of groups of hashes. This way they know in logN steps what 
chunk is corrupted, and download again. 

If yes, I don't think it could replace current checksum. It solves a different 
problem.

If no, I'll reread your mails.

> Thank You.


---   ---
Eduardo Morras 
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


[fossil-users] Repo Checksum Speedup Idea: flaw in my comment

2016-12-05 Thread Martin Vahi
As it turns out, I already made a mistake
at the tree based algorithm.

The old, proposed, flawed version:
> ...
> array_of_nodes_with_wrong_x_node_hash=unique_by_node_ID(
>   clone(
>   ob_AVL_tree.array_of_nodes_that_had_changes_on_path_2_root
>   ).concat(clone(
>   ob_AVL_tree.array_of_nodes_that_have_changed_children
>   // change within children means that any
>   // of the children changed during the insertion
>   // or removal of nodes to/from the ob_AVL_tree
>   // after the AVL-tree got (automatically) rebalanced.
>   // A change between null and an existing child is
>   // also considered a change.
>   ))
>   ).sort_by_path_length_from_node_to_root_node
> 
> ...

A newer version looks only the changes within
children:

array_of_nodes_with_wrong_x_node_hash=unique_by_node_ID(
  clone(
  ob_AVL_tree.array_of_nodes_that_have_changed_children
  )).sort_by_path_length_from_node_to_root_node

Thank You.

___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


[fossil-users] Repo Checksum Speedup Idea

2016-12-05 Thread Martin Vahi
>   De : Joerg Sonnenberger 
>  À : fossil-users@lists.fossil-scm.org
>  Envoyé le : Dimanche 4 décembre 2016 20h55
>  Objet : Re: [fossil-users] Bug report: Terrible Performance, when
Checking in LLVM Source
> ...
> No. What repo checksum does is compute a separate checksum over the
> concatenation of all files.
>
> Joerg
> ...

Thank You all for the answers to my previous
bug report letters. I have not made up my mind,
how to proceed with the large repo case, but
2 things I know for certain:

certain_thing_1)
Hash algorithms will evolve and whatever
Fossil uses in 2016_12 for the checksum,
will become obsolete.

certain_thing_2)
Regardless of what the hash algorithm is,
there exists at least one solution, that
allows to calculate a checksum of a
concatenation of a large set of files
without re-calculating the hashes of all
of the files that did not change.


The naive and slow version:

array_of_relative_file_paths=[ file_1, file_2, ..., file_N ]
blob_1=null
i_len=array_of_relative_file_paths.length
if 0,
// the blob_1 needs to be, if not allocated like
// in the above pseudocode, then at least fed in to the hash(...)
// in some stream fashion and the hash(...) has to
// re-process the file_1, file_2, ...


The gist of the proposed idea is to place
various hashes to a tree and then hash both, files
and hashes of the files, giving a probabilistic opportunity
to avoid running the hash function on all of the files
after the collection of files has changed:

ob_AVL_tree=AVL_tree_class.new

function_instance_1=(f_arg_ob_node_reference,f_arg_x_file_hash){
x_hash_left=hash(null)
x_hash_right=hash(null)

ob_child_left=f_arg_ob_node_reference.get_child_left()
ob_child_right=f_arg_ob_node_reference.get_child_right()
if ob_child_left != null {
x_hash_left=ob_child_left.record.x_node_hash
} // if
if ob_child_right != null {
x_hash_right=ob_child_right.record.x_node_hash
} // if

x_bytestream=concat(x_hash_left,
"_separator_",
x_hash_right
"_separator_",
f_arg_x_file_hash)
return hash(x_bytestream)
}

array_of_relative_file_paths=[ file_1, file_2, ..., file_N ]
i_len=array_of_relative_file_paths.length
if 0