Paul J Stevens wrote:
Matija Grabnar wrote:
I re-iterate: regardless of which digest algorithm is chosen, the code
MUST be able to
detect and correctly handle collisions. Collisions WILL occur,
regardless of the algorithm
chosen. It is a mathematically provable fact.
For those of you who have been following this discussion: I've done this
thing.
- we now use the cryptographic hash only to quickly locate possibly
duplicate mime-parts, If the hash doesn't occur yet, a new mimepart is
stored using the hash, but generating an auto-increment bigint as it's
primary key. If the hash does occur, the insertion code compares the
blobs to make sure no hash collision occurs on different blobs.
- I've added support for a whole dumpload of hashes: we now support md5,
sha1, sha256, sha512, tiger and whirlpool. Since I'm relying on mhash
for this, it would be trivial to add other hashes like ghost, but I'm
currently restricting things to the ones documenten on the nessie (EU)
pages. Looking back, adding all these was probably not really necessary
for single-instance storage, but libmhash is rock-solid and widely
available, and I have a hunch they might come in handy along the road.
I think if you comprise an index based on hash and file size it might go
a ways to speeding things up with minimal cost. You only need to worry
about a collision if the file/chunk size and the hash are the same.
_______________________________________________
DBmail mailing list
[email protected]
https://mailman.fastxs.nl/mailman/listinfo/dbmail