On Mon, Nov 14, 2011 at 10:52, Alex Besogonov <[email protected]> wrote: > I'm looking at CouchDB source code and I have several questions: > > 1) Why MD5 is used instead of more secure hashes. It's very real to > imagine a situation where a malicious user can cause hash collision > and cause problems in replication.
Can you explain a little bit more where you see this interacting with replication? > > 2) ID is not completely deterministic - it depends on > compression_level and compressible_types settings for attachments. > Would it make sense to use MD5 of the original uncompressed document? > And while you're at it, it probably makes sense to include file size > in Atts2 tuple. > Nothing in my mind requires that IDs be deterministic. It's useful for reducing conflicts when identical changes are replayed on different replicating couches, but it's not strictly required. With respect to uncompressed file size, sometimes that information is not available for attachments since they may have been send over the wire in compressed form. We went over this conversation a few times when adding compression features and it was decided that uncompressing on the fly, server-side, just to get the uncompressed file size and hash was not worth it. Attachment records do have att_len and disk_len (sometimes the same, depending on the encoding/compression during upload) properties and I believe this is exposed in the _attachments metadata on document requests. I don't know exactly what's changed since what release, so it may not be visible on released version of CouchDB. Looking at the code in master right now, I see "length", "encoded_length", and "digest" included in the attachment metadata. Thanks! -Randall
