[
https://issues.apache.org/jira/browse/ASTERIXDB-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15853163#comment-15853163
]
ASF subversion and git services commented on ASTERIXDB-1778:
------------------------------------------------------------
Commit 56189fb1116a25aca29b3989814ceea00d470e39 in asterixdb's branch
refs/heads/master from [~wangsaeu]
[ https://git-wip-us.apache.org/repos/asf?p=asterixdb.git;h=56189fb ]
ASTERIXDB-1778: Optimize the edit-distance-check function
- Only calculate 2 * (threshold + 1) cells, rather than all cells per row.
- Terminate the calculation steps early when it become obvious that
the possible edit-distance value is greater than the given threshold.
There is no reason to compute all cells in the 2 dimensional array.
- Move the location of IListIterator to Hyracks since we now have
a CharacterIterator in a String. Change the name to ISequenceIterator.
- Add the section for the function in the manual.
- Remove letter counting filtering method since it is only applicable for
the string in ASCII range (0 ~ 127).
Change-Id: Ibc8729c4514bb87c347dd7d50358fd897b769977
Reviewed-on: https://asterix-gerrit.ics.uci.edu/1481
Sonar-Qube: Jenkins <[email protected]>
Tested-by: Jenkins <[email protected]>
BAD: Jenkins <[email protected]>
Integration-Tests: Jenkins <[email protected]>
Reviewed-by: Jianfeng Jia <[email protected]>
> Calculating the edit-distance in SimilarityMetricEditDistance class can be
> improved.
> ------------------------------------------------------------------------------------
>
> Key: ASTERIXDB-1778
> URL: https://issues.apache.org/jira/browse/ASTERIXDB-1778
> Project: Apache AsterixDB
> Issue Type: Improvement
> Reporter: Taewoo Kim
> Assignee: Taewoo Kim
> Priority: Minor
>
> Local optimization of the current edit-distance handling code in AsterixDB:
> Can we terminate the edit distance calculation based on a given edit-distance
> threshold T? I think the answer is yes.
> Basic Algorithm:
> For two strings X and Y,
> Init:
> Construct D(len(X)+1=M, len(Y)+1=N).
> D(i,0) = i
> D(j,0) = j
> Iteration:
> {code}
> For each i = 1 to M
> For each j = 1 to N
> D(i,j) = min( D(i-1,j) + 1 // deletion case
> D(i,j-1) + 1 // insertion case
> D(i-1,j-1) + cost (cost is 2 if X(i) != Y(j),
> cost is 0 if X(i) = Y(j))
> )
> {code}
> Result:
> D(M,N) is the edit distance value.
> Example:
> Early Termination condition:
> So, for the given threshold T, we can stop the computation early if the
> values of three cases are greater than T. That is,
> min (D(i-1,j), D(i,j-1), D(i-1,j-1)) > T
> This holds since if the cost of all possible cases (insertion, deletion, and
> substitution) is greater than T, all future operations will be greater than T
> in any cases.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)