[ 
http://jira.jboss.com/jira/browse/JBMAIL-36?page=comments#action_12316195 ]
     
Ricardo Arguello commented on JBMAIL-36:
----------------------------------------

Shouldn't a hash function (ie SHA-1) be more appropiate for this task?


> Support (optional) CRC or long hashkey generation for bodies
> ------------------------------------------------------------
>
>          Key: JBMAIL-36
>          URL: http://jira.jboss.com/jira/browse/JBMAIL-36
>      Project: JBoss Mail
>         Type: Sub-task
>     Versions: 1.0-M4
>     Reporter: Andrew Oliver
>     Assignee: Andrew Oliver
>     Priority: Critical
>      Fix For: 1.0-M4

>
> Original Estimate: 1 week
>         Remaining: 1 week
>
> The M3 Message Store prevents bodies from being stored multiple times and 
> allows messages to stream directly to the DB.  For large messages a line by 
> line hash should be calculable and if it matches an existing message (this 
> optimizes for disk size but costs performance) then the Mailbox entry is 
> reassigned to the existing mailstore and then the new body is deleted.
> Example.  
> 1. Assume that the following is a 64mb stream that comes in (minus headers) 
> in duplicate for both mails (meaning we're sending the same file):
> body line                          CRC/checksum/whatever
> XXXXXXXXX...XXXXXXXXXXXXXXXXXXX    123456
> YYYYYYYYY...YYYYYYYYYYYYYYYYYYY    654321
> ZZZZZZZZZ...ZZZZZZZZZZZZZZZZZZZ    321654
> ...............................    ......
> XXXXXXXXX...XXXXXXXXXXXXXXXXXXX    123456
> YYYYYYYYY...YYYYYYYYYYYYYYYYYYY    654321
> ZZZZZZZZZ...ZZZZZZZZZZZZZZZZZZZ    321654
> cumulative checksum accurate to at least 1/50000000
> 12341235125132412512  
> if a "select body_id from bodies where checksum='12341235125132412512'" 
> returns more than 1 result then the new body is deleted and the mailbox is 
> assigned to the older of the two.
> So the idea above is important, algorythmic and method suggestions are not (I 
> don't know my posterior from my elbow when it comes to efficient binary 
> similarity detection -- I'm just pretty sure that's not to be done by direct 
> matching on content!).  
> It is important that minor revisions not cause collisions.  So the 1/50000000 
> target for minimum collision should not be taken to mean if you send me a 
> doc, I edit it and send it back that it drops my edits and that's okay.  It 
> means that for this to be a viable algoyrthm if I upload the text of a speech 
> and you upload a completely different speech and somehow the checksum comes 
> out just right....we could have that 1/50,000,000 chance of two very 
> different documents getting the same check, a minor revision to either should 
> fix it.
> It is also important that proper boundries be created (no chance that one 
> time we include fuzz surrounding the body and another time we don't).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://jira.jboss.com/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
JBoss-Development mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/jboss-development

Reply via email to