Taewoo Kim created ASTERIXDB-1778:
-------------------------------------
Summary: Calculating the edit-distance in
SimilarityMetricEditDistance class can be improved.
Key: ASTERIXDB-1778
URL: https://issues.apache.org/jira/browse/ASTERIXDB-1778
Project: Apache AsterixDB
Issue Type: Improvement
Reporter: Taewoo Kim
Priority: Minor
Local optimization of the current edit-distance handling code in AsterixDB: Can
we terminate the edit distance calculation based on a given edit-distance
threshold T? I think the answer is yes.
Basic Algorithm:
For two strings X and Y,
Init:
Construct D(len(X)+1=M, len(Y)+1=N).
D(i,0) = i
D(j,0) = j
Iteration:
For each i = 1 to M
For each j = 1 to N
D(i,j) = min( D(i-1,j) + 1 // deletion case
D(i,j-1) + 1 // insertion case
D(i-1,j-1) + cost (cost is 2 if X(i) != Y(j),
cost is 0 if X(i) = Y(j))
)
Result:
D(M,N) is the edit distance value.
Example:
Early Termination condition:
So, for the given threshold T, we can stop the computation early if the values
of three cases are greater than T. That is,
min (D(i-1,j), D(i,j-1), D(i-1,j-1)) > T
This holds since if the cost of all possible cases (insertion, deletion, and
substitution) is greater than T, all future operations will be greater than T
in any cases.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)