Thanks for responding Jonathan. I will look into k-grams approach. 

The objects could differ by small local changes. To provide some business 
context, the application requires indexing email messages and attachments. If 
the attachments differ by some threshold (users making edits/reviews), the 
attachment needs to be flagged and major/minor versioned.

Thanks
AS


----- Original Message ----
From: Jonathan Young <[EMAIL PROTECTED]>
To: Aaron Schon <[EMAIL PROTECTED]>
Sent: Friday, November 14, 2008 5:52:17 PM
Subject: RE: Extremely Large Strings Comparison (slightly off-topic)

Aaron - Although a naïve implementation of a Levenshtein distance metric takes 
O(n*m) time, if you are willing to bound the maximum distance by k << n,m (e.g. 
if you aren't interested in distances greater than k) then the distance 
calculation can take O(k*min(n,m)).

Another approach would be to use shingles or k-grams from the original 
document.  I believe Lucene has some support for that.

It depends a lot on the ways in which you expect the two strings to differ.  
The fact that it is Base64-encoded MIME content doesn't matter - are they 
actually objects which are going to have small local changes, or are there 
likely to be changes which cascade (e.g. if the data is compressed).

I hope this helps,

--- Jonathan

-----Original Message-----
From: Aaron Schon [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2008 5:00 PM
To: java-user@lucene.apache.org
Subject: Extremely Large Strings Comparison (slightly off-topic)

hi I need to compare two Base64 representation strings of some MIME content 
that I am storing within a Lucene index. I need to efficiently compare them to 
find the closest match to a query Base64 string , post Lucene query.

I am not sure of the best way to approach this, could I compare the hashes and 
compute their similarity? Levenshtein distance seems hard because of the size 
of ths strings and seems inefficient? Is there any other method you could 
suggest?

n.b: The idea is to not to determine exact match or not, it is to compute a 
similarity metric. for example

John & Johnson (closer)
vs,
John & Jimmy (farther)

tia,
Aaron




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to