[
https://issues.apache.org/jira/browse/ASTERIXDB-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15849528#comment-15849528
]
Taewoo Kim commented on ASTERIXDB-1778:
---------------------------------------
Actually, I was wrong about the output.
Currently, if the first element is false, then the second element is not an
actual edit-distance, it's Integer.MAX_VALUE. So, I think we are OK. Applying
the optimized version will not change the contract. To summarize,
The output of edit_distance_check function - a list where the first element is
true/false of the evaluation and the second element is the edit distance for
the given expressions (if the first element is false, then the value of the
second element is Integer.MAX_VALUE.)
> Calculating the edit-distance in SimilarityMetricEditDistance class can be
> improved.
> ------------------------------------------------------------------------------------
>
> Key: ASTERIXDB-1778
> URL: https://issues.apache.org/jira/browse/ASTERIXDB-1778
> Project: Apache AsterixDB
> Issue Type: Improvement
> Reporter: Taewoo Kim
> Assignee: Taewoo Kim
> Priority: Minor
>
> Local optimization of the current edit-distance handling code in AsterixDB:
> Can we terminate the edit distance calculation based on a given edit-distance
> threshold T? I think the answer is yes.
> Basic Algorithm:
> For two strings X and Y,
> Init:
> Construct D(len(X)+1=M, len(Y)+1=N).
> D(i,0) = i
> D(j,0) = j
> Iteration:
> {code}
> For each i = 1 to M
> For each j = 1 to N
> D(i,j) = min( D(i-1,j) + 1 // deletion case
> D(i,j-1) + 1 // insertion case
> D(i-1,j-1) + cost (cost is 2 if X(i) != Y(j),
> cost is 0 if X(i) = Y(j))
> )
> {code}
> Result:
> D(M,N) is the edit distance value.
> Example:
> Early Termination condition:
> So, for the given threshold T, we can stop the computation early if the
> values of three cases are greater than T. That is,
> min (D(i-1,j), D(i,j-1), D(i-1,j-1)) > T
> This holds since if the cost of all possible cases (insertion, deletion, and
> substitution) is greater than T, all future operations will be greater than T
> in any cases.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)