On 4/25/11 2:03 PM, Paul J Stevens wrote: > On 04/25/2011 08:49 PM, tabris wrote: >> Could you not make it 2-stage, and then configurable as to whether to >> do split query or not? >> >> Basically have both queries. Only if the first succeeds do you check >> the second. Kinda like this: >> >> if("SELECT 1 FROM dbmail_mimeparts WHERE hash=? AND size=? LIMIT 1") >> { if("SELECT id FROM dbmail_mimeparts WHERE hash=? AND size=? AND >> blob=?") { } } >> >> It would admittedly make the case where there is a legitimate >> collision (legitimate as in, this _really_ is the same blob) slower. >> I don't know if this is a particular problem or not. The case of the >> second query failing (but the first succeeded) should be a degenerate >> case. > I think it would be a problem. And the case is far from degenerate in > certain valid usage scenarios.
The case of the first & second both succeeding certainly is not degenerate, esp for mailing list recipients. The case of hash&size-matching, blob not matching should be degenerate (otherwise the case of not using the blob at all would have never been valid in the first place) And I'm not saying it should be on by default. But it seems to be the best compromise btwn the original-poster's concerns and maintaining integrity. > You would be slowing down the main use case for single-instance storage. > Basically it would increase the amount of network IO even further for > the case where a given blob is already in the database. Your approach > will only speed up case where a given blob hasn't been seen before. True and well understood when I made the suggestion, but I don't think the increase in bandwidth will be too high, and on the average will decrease the bandwidth across the wire (you won't be, by the most common case, sending the blob across the wire twice). Current flow, 99+% of time, we push blob across wire twice if(unlikely(SQL)) { # pushes blob across wire, but not for storage, most of the time the test fails # we already have the blob in db, don't have to push again. } else { # most common case. pushBlob2DB } Suggested [optional] flow, 99+% we push blob across wire once if(unlikely(SQL1)) { # common test, no blob, test most often fails if(likely(SQL2)) { # involves pushing blob in query, but not for storage } else { # pushes blob to storage, so we did push it across wire twice pushBlob2DB } } else { # we pushed it across wire only once pushBlob2DB. } This should optimize the common case, albeit at making the less common case more intensive... Albeit the 99% assumes that the most common email case is NOT matching (by hash) blobs. Maybe I'm making an incorrect assumption. > The thing in favor of your idea is that the most common occurrence of > legitimate collision (blob_exists==true) may tend to happen mostly for > small blobs like chunks of whitespace. > Interesting...
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Dbmail-dev mailing list Dbmail-dev@dbmail.org http://mailman.fastxs.nl/cgi-bin/mailman/listinfo/dbmail-dev