[ https://issues.apache.org/jira/browse/ASTERIXDB-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15853163#comment-15853163 ]
ASF subversion and git services commented on ASTERIXDB-1778: ------------------------------------------------------------ Commit 56189fb1116a25aca29b3989814ceea00d470e39 in asterixdb's branch refs/heads/master from [~wangsaeu] [ https://git-wip-us.apache.org/repos/asf?p=asterixdb.git;h=56189fb ] ASTERIXDB-1778: Optimize the edit-distance-check function - Only calculate 2 * (threshold + 1) cells, rather than all cells per row. - Terminate the calculation steps early when it become obvious that the possible edit-distance value is greater than the given threshold. There is no reason to compute all cells in the 2 dimensional array. - Move the location of IListIterator to Hyracks since we now have a CharacterIterator in a String. Change the name to ISequenceIterator. - Add the section for the function in the manual. - Remove letter counting filtering method since it is only applicable for the string in ASCII range (0 ~ 127). Change-Id: Ibc8729c4514bb87c347dd7d50358fd897b769977 Reviewed-on: https://asterix-gerrit.ics.uci.edu/1481 Sonar-Qube: Jenkins <jenk...@fulliautomatix.ics.uci.edu> Tested-by: Jenkins <jenk...@fulliautomatix.ics.uci.edu> BAD: Jenkins <jenk...@fulliautomatix.ics.uci.edu> Integration-Tests: Jenkins <jenk...@fulliautomatix.ics.uci.edu> Reviewed-by: Jianfeng Jia <jianfeng....@gmail.com> > Calculating the edit-distance in SimilarityMetricEditDistance class can be > improved. > ------------------------------------------------------------------------------------ > > Key: ASTERIXDB-1778 > URL: https://issues.apache.org/jira/browse/ASTERIXDB-1778 > Project: Apache AsterixDB > Issue Type: Improvement > Reporter: Taewoo Kim > Assignee: Taewoo Kim > Priority: Minor > > Local optimization of the current edit-distance handling code in AsterixDB: > Can we terminate the edit distance calculation based on a given edit-distance > threshold T? I think the answer is yes. > Basic Algorithm: > For two strings X and Y, > Init: > Construct D(len(X)+1=M, len(Y)+1=N). > D(i,0) = i > D(j,0) = j > Iteration: > {code} > For each i = 1 to M > For each j = 1 to N > D(i,j) = min( D(i-1,j) + 1 // deletion case > D(i,j-1) + 1 // insertion case > D(i-1,j-1) + cost (cost is 2 if X(i) != Y(j), > cost is 0 if X(i) = Y(j)) > ) > {code} > Result: > D(M,N) is the edit distance value. > Example: > Early Termination condition: > So, for the given threshold T, we can stop the computation early if the > values of three cases are greater than T. That is, > min (D(i-1,j), D(i,j-1), D(i-1,j-1)) > T > This holds since if the cost of all possible cases (insertion, deletion, and > substitution) is greater than T, all future operations will be greater than T > in any cases. -- This message was sent by Atlassian JIRA (v6.3.15#6346)