Re: Why MD5 is used for hashes, also about non-deterministic IDs.

Randall Leeds Mon, 14 Nov 2011 14:48:41 -0800

On Mon, Nov 14, 2011 at 10:52, Alex Besogonov <[email protected]> wrote:
> I'm looking at CouchDB source code and I have several questions:
>
> 1) Why MD5 is used instead of more secure hashes. It's very real to
> imagine a situation where a malicious user can cause hash collision
> and cause problems in replication.


Can you explain a little bit more where you see this interacting with
replication?

>
> 2) ID is not completely deterministic - it depends on
> compression_level and compressible_types settings for attachments.
> Would it make sense to use MD5 of the original uncompressed document?
> And while you're at it, it probably makes sense to include file size
> in Atts2 tuple.
>

Nothing in my mind requires that IDs be deterministic. It's useful for
reducing conflicts when identical changes are replayed on different
replicating couches, but it's not strictly required.

With respect to uncompressed file size, sometimes that information is
not available for attachments since they may have been send over the
wire in compressed form. We went over this conversation a few times
when adding compression features and it was decided that uncompressing
on the fly, server-side, just to get the uncompressed file size and
hash was not worth it.

Attachment records do have att_len and disk_len (sometimes the same,
depending on the encoding/compression during upload) properties and I
believe this is exposed in the _attachments metadata on document
requests. I don't know exactly what's changed since what release, so
it may not be visible on released version of CouchDB. Looking at the
code in master right now, I see "length", "encoded_length", and
"digest" included in the attachment metadata.

Thanks!
-Randall

Re: Why MD5 is used for hashes, also about non-deterministic IDs.

Reply via email to