[ http://jira.jboss.com/jira/browse/JBMAIL-36?page=comments#action_12316219 ] Andrew Oliver commented on JBMAIL-36: -------------------------------------
Yeah. > Support (optional) CRC or long hashkey generation for bodies > ------------------------------------------------------------ > > Key: JBMAIL-36 > URL: http://jira.jboss.com/jira/browse/JBMAIL-36 > Project: JBoss Mail > Type: Sub-task > Versions: 1.0-M4 > Reporter: Andrew Oliver > Assignee: Andrew Oliver > Priority: Critical > Fix For: 1.0-M4 > > Original Estimate: 1 week > Remaining: 1 week > > The M3 Message Store prevents bodies from being stored multiple times and > allows messages to stream directly to the DB. For large messages a line by > line hash should be calculable and if it matches an existing message (this > optimizes for disk size but costs performance) then the Mailbox entry is > reassigned to the existing mailstore and then the new body is deleted. > Example. > 1. Assume that the following is a 64mb stream that comes in (minus headers) > in duplicate for both mails (meaning we're sending the same file): > body line CRC/checksum/whatever > XXXXXXXXX...XXXXXXXXXXXXXXXXXXX 123456 > YYYYYYYYY...YYYYYYYYYYYYYYYYYYY 654321 > ZZZZZZZZZ...ZZZZZZZZZZZZZZZZZZZ 321654 > ............................... ...... > XXXXXXXXX...XXXXXXXXXXXXXXXXXXX 123456 > YYYYYYYYY...YYYYYYYYYYYYYYYYYYY 654321 > ZZZZZZZZZ...ZZZZZZZZZZZZZZZZZZZ 321654 > cumulative checksum accurate to at least 1/50000000 > 12341235125132412512 > if a "select body_id from bodies where checksum='12341235125132412512'" > returns more than 1 result then the new body is deleted and the mailbox is > assigned to the older of the two. > So the idea above is important, algorythmic and method suggestions are not (I > don't know my posterior from my elbow when it comes to efficient binary > similarity detection -- I'm just pretty sure that's not to be done by direct > matching on content!). > It is important that minor revisions not cause collisions. So the 1/50000000 > target for minimum collision should not be taken to mean if you send me a > doc, I edit it and send it back that it drops my edits and that's okay. It > means that for this to be a viable algoyrthm if I upload the text of a speech > and you upload a completely different speech and somehow the checksum comes > out just right....we could have that 1/50,000,000 chance of two very > different documents getting the same check, a minor revision to either should > fix it. > It is also important that proper boundries be created (no chance that one > time we include fuzz surrounding the body and another time we don't). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://jira.jboss.com/jira/secure/Administrators.jspa - If you want more information on JIRA, or have a bug to report see: http://www.atlassian.com/software/jira ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ JBoss-Development mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/jboss-development
