[ 
https://issues.apache.org/jira/browse/ASTERIXDB-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848036#comment-15848036
 ] 

Taewoo Kim edited comment on ASTERIXDB-1778 at 2/1/17 6:09 AM:
---------------------------------------------------------------

Got a hint from 
http://stackoverflow.com/questions/9453731/how-to-calculate-distance-similarity-measure-of-given-2-strings/9454016#9454016:

Here are the revised method:

Basic Algorithm:
For two strings X and Y, 

Init:
Construct D(len(X)+1=M, len(Y)+1=N).
D(i,0) = i
D(j,0) = j
T = Given Threshold

Iteration:
{code}
For each i = 1 to M
    MinDistance = INT.MAX_VALUE
    For each j = 1 to N:
        D(i,j) = min( D(i-1,j) + 1 // deletion case
                      D(i,j-1) + 1 // insertion case
                      D(i-1,j-1) + cost (cost is 1 if X( i ) != Y(j), 
                                                 cost is 0 if X( i ) = Y(j))
                    )
         If (D(i,j) < MinDistance) 
               MinDistance = D(i,j)
     // After processing each row,
     If (MinDistance > T)
         return -1
{code}
     
Result:                    
D(M,N) is the edit distance value. -1 is returned in case when it becomes 
obvious that the final edit-distance will be greater than T.

Early Termination condition:
So, for the given threshold T, we can stop the computation early if the min 
value among all possible values from each row is greater than the values of 
three cases are greater than T.



was (Author: wangsaeu):
Got a hint from 
http://stackoverflow.com/questions/9453731/how-to-calculate-distance-similarity-measure-of-given-2-strings/9454016#9454016:

Here are the revised method:

Basic Algorithm:
For two strings X and Y, 

Init:
Construct D(len(X)+1=M, len(Y)+1=N).
D(i,0) = i
D(j,0) = j
T = Given Threshold

Iteration:
For each i = 1 to M
    MinDistance = INT.MAX_VALUE
    For each j = 1 to N:
        D(i,j) = min( D(i-1,j) + 1 // deletion case
                      D(i,j-1) + 1 // insertion case
                      D(i-1,j-1) + cost (cost is 1 if X(i) != Y(j), 
                                                 cost is 0 if X(i) = Y(j))
                    )
         If (D(i,j) < MinDistance) 
               MinDistance = D(i,j)
     // After processing each row,
     If (MinDistance > T)
         return -1
     
Result:                    
D(M,N) is the edit distance value. -1 is returned in case when it becomes 
obvious that the final edit-distance will be greater than T.

Early Termination condition:
So, for the given threshold T, we can stop the computation early if the min 
value among all possible values from each row is greater than the values of 
three cases are greater than T.


> Calculating the edit-distance in SimilarityMetricEditDistance class can be 
> improved.
> ------------------------------------------------------------------------------------
>
>                 Key: ASTERIXDB-1778
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1778
>             Project: Apache AsterixDB
>          Issue Type: Improvement
>            Reporter: Taewoo Kim
>            Priority: Minor
>
> Local optimization of the current edit-distance handling code in AsterixDB: 
> Can we terminate the edit distance calculation based on a given edit-distance 
> threshold T? I think the answer is yes.
> Basic Algorithm:
> For two strings X and Y, 
> Init:
> Construct D(len(X)+1=M, len(Y)+1=N).
> D(i,0) = i
> D(j,0) = j
> Iteration:
> For each i = 1 to M
>     For each j = 1 to N
>         D(i,j) = min( D(i-1,j) + 1 // deletion case
>                       D(i,j-1) + 1 // insertion case
>                       D(i-1,j-1) + cost (cost is 2 if X(i) != Y(j), 
>                                                  cost is 0 if X(i) = Y(j))
>                     )
> Result:                    
> D(M,N) is the edit distance value.
> Example:
> Early Termination condition:
> So, for the given threshold T, we can stop the computation early if the 
> values of three cases are greater than T. That is,
> min (D(i-1,j), D(i,j-1), D(i-1,j-1)) > T
> This holds since if the cost of all possible cases (insertion, deletion, and 
> substitution) is greater than T, all future operations will be greater than T 
> in any cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to