Paul J Stevens wrote:
I'm using sha1, not sha256.

And I'm hashing MIME-payload and MIME-headers, not full mime-parts. That means
that even if the same file is sent using different names (myfile.pdf and
yourfile.pdf) that file is stored only once since the actual payload will render
the same hash.

That is going to lead to trouble. Some years ago I had occasion to calculate checksums
of very large number of files (looking to remove duplicates).
I discovered, to my dismay, that
a) I was getting collisions (same checksum) on files which were obviously different (because
they were different size).
b) I was getting collisions on files which were not very obviously different - same checksum,
same size, but different contents.

If you examine the mathematical theory, no matter how good the checksum algorithm, if your checksum number is smaller than the files you are calculating it over (and it usually is), then you will have a large number (approximately size of max object / size of the checksum) objects which will result in the same
checksum.

Please, please, if you discover that two mime parts have the same checksum, and the same size, please check that they are really equal (fetch the saved one from the storage and compare for equality). Such comparisons won't happen often (checksum collisions are designed to be rare), so it should not cost a lot of computing time. But if you don't do it, sooner or later you will lose a mime part that was truly different from what was already stored, and your users
(and my users!) will be furious.


_______________________________________________
DBmail mailing list
[email protected]
https://mailman.fastxs.nl/mailman/listinfo/dbmail

Reply via email to