Re: [Dbmail-dev] blob_exists - selects based on blob as well as hash?

tabris Mon, 25 Apr 2011 16:44:41 -0700

On 4/25/11 2:03 PM, Paul J Stevens wrote:
> On 04/25/2011 08:49 PM, tabris wrote:
>> Could you not make it 2-stage, and then configurable as to whether to
>> do split query or not?
>>
>> Basically have both queries. Only if the first succeeds do you check
>> the second. Kinda like this:
>>
>> if("SELECT 1 FROM dbmail_mimeparts WHERE hash=? AND size=? LIMIT 1")
>> { if("SELECT id FROM dbmail_mimeparts WHERE hash=? AND size=? AND
>> blob=?") { } }
>>
>> It would admittedly make the case where there is a legitimate
>> collision (legitimate as in, this _really_ is the same blob) slower.
>> I don't know if this is a particular problem or not. The case of the
>> second query failing (but the first succeeded) should be a degenerate
>> case.
> I think it would be a problem. And the case is far from degenerate in
> certain valid usage scenarios.


    The case of the first & second both succeeding certainly is not
degenerate, esp for mailing list recipients. The case of
hash&size-matching, blob not matching should be degenerate (otherwise
the case of not using the blob at all would have never been valid in the
first place)

    And I'm not saying it should be on by default. But it seems to be
the best compromise btwn the original-poster's concerns and maintaining
integrity.
> You would be slowing down the main use case for single-instance storage.
> Basically it would increase the amount of network IO even further for
> the case where a given blob is already in the database. Your approach
> will only speed up case where a given blob hasn't been seen before.

    True and well understood when I made the suggestion, but I don't
think the increase in bandwidth will be too high, and on the average
will decrease the bandwidth across the wire (you won't be, by the most
common case, sending the blob across the wire twice).


Current flow, 99+% of time, we push blob across wire twice
if(unlikely(SQL)) { # pushes blob across wire, but not for storage, most
of the time the test fails
    # we already have the blob in db, don't have to push again.
} else {
    # most common case.
    pushBlob2DB
}

Suggested [optional] flow, 99+% we push blob across wire once

if(unlikely(SQL1)) { # common test, no blob, test most often fails
    if(likely(SQL2)) { # involves pushing blob in query, but not for storage
    } else { # pushes blob to storage, so we did push it across wire twice
        pushBlob2DB
    }
} else { # we pushed it across wire only once
    pushBlob2DB.
}

This should optimize the common case, albeit at making the less common
case more intensive...


Albeit the 99% assumes that the most common email case is NOT matching
(by hash) blobs. Maybe I'm making an incorrect assumption.




> The thing in favor of your idea is that the most common occurrence of
> legitimate collision (blob_exists==true) may tend to happen mostly for
> small blobs like chunks of whitespace.
>
Interesting...

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Dbmail-dev mailing list
Dbmail-dev@dbmail.org
http://mailman.fastxs.nl/cgi-bin/mailman/listinfo/dbmail-dev

Re: [Dbmail-dev] blob_exists - selects based on blob as well as hash?

Reply via email to