This is an automated email from the ASF dual-hosted git repository. kinow pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/commons-text.git
commit b499f8586c0da4b690afdb1fda39f701e6c2babd Author: ali-ghanbari <[email protected]> AuthorDate: Sat May 22 17:01:51 2021 -0500 [TEXT-212] A more efficient implementation for calculating LCS 1. Additions: (a) An implementation of Hirschberg's Longest Commons Substring (LCS) algorithm. This implementation is more space efficient than the previous version (linear vs. quadratic), but the time complexity stays the same. (b) Javadoc and comments are updated accordingly and a couple of typos are fixed. (c) JMH performance analysis for LongestCommonSubsequence (d) A CSV file containing test inputs for JMH analysis is added to resources. Please note that jaring this text file (which contains only English characters) should make it quite small. So, we should not worry about the JAR size. (e) Modified POM file: makred Ali Ghanbari as a contributor and made several modifications to avoid conflict. 2. Modifications: (a) The method longestCommonSubstringLengthArray is marked as deprecated. --- pom.xml | 5 + .../text/similarity/LongestCommonSubsequence.java | 265 ++++++++++++++++----- .../jmh/LongestCommonSubsequencePerformance.java | 165 +++++++++++++ src/test/resources/lcs-perf-analysis-inputs.csv | 5 + 4 files changed, 375 insertions(+), 65 deletions(-) diff --git a/pom.xml b/pom.xml index 47ca065..d898fa9 100644 --- a/pom.xml +++ b/pom.xml @@ -153,6 +153,7 @@ <exclude>src/site/resources/download_lang.cgi</exclude> <exclude>src/test/resources/stringEscapeUtilsTestData.txt</exclude> <exclude>src/site/resources/release-notes/RELEASE-NOTES-*.txt</exclude> + <exclude>src/test/resources/lcs-perf-analysis-inputs.csv</exclude> </excludes> </configuration> </plugin><!-- override skip property of parent pom --> @@ -431,6 +432,10 @@ <contributor> <name>Nick Wong</name> </contributor> + <contributor> + <name>Ali Ghanbari</name> + <url>https://ali-ghanbari.github.io/</url> + </contributor> </contributors> <scm> diff --git a/src/main/java/org/apache/commons/text/similarity/LongestCommonSubsequence.java b/src/main/java/org/apache/commons/text/similarity/LongestCommonSubsequence.java index 0f51906..4d35b19 100644 --- a/src/main/java/org/apache/commons/text/similarity/LongestCommonSubsequence.java +++ b/src/main/java/org/apache/commons/text/similarity/LongestCommonSubsequence.java @@ -29,28 +29,45 @@ package org.apache.commons.text.similarity; * </p> * * <p> - * This implementation is based on the Longest Commons Substring algorithm - * from <a href="https://en.wikipedia.org/wiki/Longest_common_subsequence_problem"> - * https://en.wikipedia.org/wiki/Longest_common_subsequence_problem</a>. + * As of version 1.10, a more space-efficient of the algorithm is implemented. The new algorithm has linear space + * complexity instead of quadratic. However, time complexity is still quadratic in the size of input strings. * </p> * - * <p>For further reading see:</p> + * <p> + * The implementation is based on Hirschberg's Longest Commons Substring algorithm (cited below). + * </p> * - * <p>Lothaire, M. <i>Applied combinatorics on words</i>. New York: Cambridge U Press, 2005. <b>12-13</b></p> + * <p>For further reading see:</p> + * <ul> + * <li> + * Lothaire, M. <i>Applied combinatorics on words</i>. New York: Cambridge U Press, 2005. <b>12-13</b> + * </li> + * <li> + * D. S. Hirschberg, "A linear space algorithm for computing maximal common subsequences," CACM, 1975, pp. 341--343. + * </li> + * </ul> * * @since 1.0 */ public class LongestCommonSubsequence implements SimilarityScore<Integer> { - /** - * Calculates longest common subsequence similarity score of two {@code CharSequence}'s passed as + * Calculates the longest common subsequence similarity score of two {@code CharSequence}'s passed as * input. * - * @param left first character sequence - * @param right second character sequence - * @return longestCommonSubsequenceLength - * @throws IllegalArgumentException - * if either String input {@code null} + * <p> + * This method implements a more efficient version of LCS algorithm which has quadratic time and + * linear space complexity. + * </p> + * + * <p> + * This method is based on newly implemented {@link #algorithmB(CharSequence, CharSequence)}. + * An evaluation using JMH revealed that this method is almost two times faster than its previous version. + * </p> + * + * @param left First character sequence + * @param right Second character sequence + * @return Length of the longest common subsequence of <code>left</code> and <code>right</code> + * @throws IllegalArgumentException if either String input {@code null} */ @Override public Integer apply(final CharSequence left, final CharSequence right) { @@ -58,7 +75,63 @@ public class LongestCommonSubsequence implements SimilarityScore<Integer> { if (left == null || right == null) { throw new IllegalArgumentException("Inputs must not be null"); } - return longestCommonSubsequence(left, right).length(); + // Find lengths of two strings + final int leftSz = left.length(); + final int rightSz = right.length(); + + // Check if we can avoid calling algorithmB which involves heap space allocation + if (leftSz == 0 || rightSz == 0) { + return 0; + } + + // Check if we can save even more space + if (leftSz < rightSz) { + return algorithmB(right, left)[leftSz]; + } + return algorithmB(left, right)[rightSz]; + } + + /** + * An implementation of "ALG B" from Hirschberg's CACM '71 paper. + * Assuming the first input sequence is of size <code>m</code> and the second input sequence is of size + * <code>n</code>, this method returns the last row of the dynamic programming (DP) table when calculating + * the LCS of the two sequences in <i>O(m*n)</i> time and <i>O(n)</i> space. + * The last element of the returned array, is the size of the LCS of the two input sequences. + * + * @param left First input sequence. + * @param right Second input sequence. + * @return Last row of DP table for calculating the LCS of <code>left</code> and <code>right</code> + * @since 1.10 + */ + static int[] algorithmB(final CharSequence left, final CharSequence right) { + final int m = left.length(); + final int n = right.length(); + + // Creating an array for storing two rows of DP table + final int[][] dpRows = new int[2][1 + n]; + + for (int i = 1; i <= m; i++) { + // K(0, j) <- K(1, j) [j = 0...n], as per the paper: + // Since we have references in Java, we can swap references instead of literal copying. + // We could also use a "binary index" using modulus operator, but directly swapping the + // two rows helps readability and keeps the code consistent with the algorithm description + // in the paper. + final int[] temp = dpRows[0]; + dpRows[0] = dpRows[1]; + dpRows[1] = temp; + + for (int j = 1; j <= n; j++) { + if (left.charAt(i - 1) == right.charAt(j - 1)) { + dpRows[1][j] = dpRows[0][j - 1] + 1; + } else { + dpRows[1][j] = Math.max(dpRows[1][j - 1], dpRows[0][j]); + } + } + } + + // LL(j) <- K(1, j) [j=0...n], as per the paper: + // We don't need literal copying of the array, we can just return the reference + return dpRows[1]; } /** @@ -80,68 +153,126 @@ public class LongestCommonSubsequence implements SimilarityScore<Integer> { * @param left first character sequence * @param right second character sequence * @return The longest common subsequence found - * @throws IllegalArgumentException - * if either String input {@code null} + * @throws IllegalArgumentException if either String input {@code null} * @deprecated Deprecated as of 1.2 due to a typo in the method name. - * Use {@link #longestCommonSubsequence(CharSequence, CharSequence)} instead. - * This method will be removed in 2.0. + * Use {@link #longestCommonSubsequence(CharSequence, CharSequence)} instead. + * This method will be removed in 2.0. */ @Deprecated public CharSequence logestCommonSubsequence(final CharSequence left, final CharSequence right) { return longestCommonSubsequence(left, right); } - /** - * Computes the longest common subsequence between the two {@code CharSequence}'s passed as - * input. - * - * <p> - * Note, a substring and subsequence are not necessarily the same thing. Indeed, {@code abcxyzqrs} and - * {@code xyzghfm} have both the same common substring and subsequence, namely {@code xyz}. However, - * {@code axbyczqrs} and {@code abcxyzqtv} have the longest common subsequence {@code xyzq} because a - * subsequence need not have adjacent characters. - * </p> - * - * <p> - * For reference, we give the definition of a subsequence for the reader: a <i>subsequence</i> is a sequence that - * can be derived from another sequence by deleting some elements without changing the order of the remaining - * elements. - * </p> - * - * @param left first character sequence - * @param right second character sequence - * @return The longest common subsequence found - * @throws IllegalArgumentException - * if either String input {@code null} - * @since 1.2 - */ - public CharSequence longestCommonSubsequence(final CharSequence left, final CharSequence right) { - // Quick return - if (left == null || right == null) { - throw new IllegalArgumentException("Inputs must not be null"); - } - final StringBuilder longestCommonSubstringArray = new StringBuilder(Math.max(left.length(), right.length())); - final int[][] lcsLengthArray = longestCommonSubstringLengthArray(left, right); - int i = left.length() - 1; - int j = right.length() - 1; - int k = lcsLengthArray[left.length()][right.length()] - 1; - while (k >= 0) { - if (left.charAt(i) == right.charAt(j)) { - longestCommonSubstringArray.append(left.charAt(i)); - i = i - 1; - j = j - 1; - k = k - 1; - } else if (lcsLengthArray[i + 1][j] < lcsLengthArray[i][j + 1]) { - i = i - 1; - } else { - j = j - 1; - } - } - return longestCommonSubstringArray.reverse().toString(); - } + /** + * Computes the longest common subsequence between the two {@code CharSequence}'s passed as + * input. + * + * <p> + * This method implements a more efficient version of LCS algorithm which although has quadratic time, it + * has linear space complexity. + * </p> + * + * + * <p> + * Note, a substring and subsequence are not necessarily the same thing. Indeed, {@code abcxyzqrs} and + * {@code xyzghfm} have both the same common substring and subsequence, namely {@code xyz}. However, + * {@code axbyczqrs} and {@code abcxyzqtv} have the longest common subsequence {@code xyzq} because a + * subsequence need not have adjacent characters. + * </p> + * + * <p> + * For reference, we give the definition of a subsequence for the reader: a <i>subsequence</i> is a sequence that + * can be derived from another sequence by deleting some elements without changing the order of the remaining + * elements. + * </p> + * + * @param left First character sequence + * @param right Second character sequence + * @return The longest common subsequence found + * @throws IllegalArgumentException if either String input {@code null} + * @since 1.2 + */ + public CharSequence longestCommonSubsequence(final CharSequence left, final CharSequence right) { + // Quick return + if (left == null || right == null) { + throw new IllegalArgumentException("Inputs must not be null"); + } + // Find lengths of two strings + final int leftSz = left.length(); + final int rightSz = right.length(); + + // Check if we can avoid calling algorithmC which involves heap space allocation + if (leftSz == 0 || rightSz == 0) { + return ""; + } + + // Check if we can save even more space + if (leftSz < rightSz) { + return algorithmC(right, left); + } + return algorithmC(left, right); + } /** + * An implementation of "ALG C" from Hirschberg's CACM '71 paper. + * Assuming the first input sequence is of size <code>m</code> and the second input sequence is of size + * <code>n</code>, this method return the Longest Common Subsequence (LCS) the two sequences in + * <i>O(m*n)</i> time and <i>O(m+n)</i> space. * + * @param left First input sequence. + * @param right Second input sequence. + * @return The LCS of <code>left</code> and <code>right</code> + * @since 1.10 + */ + static String algorithmC(final CharSequence left, final CharSequence right) { + final int m = left.length(); + final int n = right.length(); + + String out = ""; + + if (m == 1) { // Handle trivial cases, as per the paper + final char leftCh = left.charAt(0); + for (int j = 0; j < n; j++) { + if (leftCh == right.charAt(j)) { + out += leftCh; + break; + } + } + } else if (n > 0 && m > 1) { + final int mid = m / 2; // Find the middle point + + final CharSequence leftFirstPart = left.subSequence(0, mid); + final CharSequence leftSecondPart = left.subSequence(mid, m); + + // Step 3 of the algorithm: two calls to Algorithm B + final int[] l1 = algorithmB(leftFirstPart, right); + final int[] l2 = algorithmB(reverse(leftSecondPart), reverse(right)); + + // Find k, as per the Step 4 of the algorithm + int k = 0; + int t = 0; + for (int j = 0; j <= n; j++) { + final int s = l1[j] + l2[n - j]; + if (t < s) { + t = s; + k = j; + } + } + + // Step 5: solve simpler problems, recursively + out = out.concat(algorithmC(leftFirstPart, right.subSequence(0, k))); + out = out.concat(algorithmC(leftSecondPart, right.subSequence(k, n))); + } + + return out; + } + + // An auxiliary method for CharSequence reversal + private static String reverse(final CharSequence s) { + return (new StringBuilder(s)).reverse().toString(); + } + + /** * Computes the lcsLengthArray for the sake of doing the actual lcs calculation. This is the * dynamic programming portion of the algorithm, and is the reason for the runtime complexity being * O(m*n), where m=left.length() and n=right.length(). @@ -149,7 +280,11 @@ public class LongestCommonSubsequence implements SimilarityScore<Integer> { * @param left first character sequence * @param right second character sequence * @return lcsLengthArray + * @deprecated Deprecated as of 1.10. A more efficient implementation for calculating LCS is now available. + * Use {@link #longestCommonSubsequence(CharSequence, CharSequence)} instead to directly calculate the LCS. + * This method will be removed in 2.0. */ + @Deprecated public int[][] longestCommonSubstringLengthArray(final CharSequence left, final CharSequence right) { final int[][] lcsLengthArray = new int[left.length() + 1][right.length() + 1]; for (int i = 0; i < left.length(); i++) { diff --git a/src/test/java/org/apache/commons/text/jmh/LongestCommonSubsequencePerformance.java b/src/test/java/org/apache/commons/text/jmh/LongestCommonSubsequencePerformance.java new file mode 100644 index 0000000..666758b --- /dev/null +++ b/src/test/java/org/apache/commons/text/jmh/LongestCommonSubsequencePerformance.java @@ -0,0 +1,165 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.commons.text.jmh; + +import org.apache.commons.lang3.tuple.ImmutablePair; +import org.apache.commons.lang3.tuple.Pair; +import org.apache.commons.text.similarity.LongestCommonSubsequence; +import org.apache.commons.text.similarity.SimilarityScore; +import org.openjdk.jmh.annotations.Benchmark; +import org.openjdk.jmh.annotations.BenchmarkMode; +import org.openjdk.jmh.annotations.Fork; +import org.openjdk.jmh.annotations.Level; +import org.openjdk.jmh.annotations.Measurement; +import org.openjdk.jmh.annotations.Mode; +import org.openjdk.jmh.annotations.OutputTimeUnit; +import org.openjdk.jmh.annotations.Scope; +import org.openjdk.jmh.annotations.Setup; +import org.openjdk.jmh.annotations.State; +import org.openjdk.jmh.annotations.Warmup; + +import java.io.BufferedReader; +import java.io.IOException; +import java.io.InputStream; +import java.io.InputStreamReader; +import java.util.ArrayList; +import java.util.List; +import java.util.Objects; +import java.util.concurrent.TimeUnit; + +/** + * Performance analysis for LongestCommonSubsequence + */ +@BenchmarkMode(Mode.AverageTime) +@OutputTimeUnit(TimeUnit.MILLISECONDS) +@Warmup(iterations = 5, time = 1) +@Measurement(iterations = 5, time = 1) +@Fork(value = 1, jvmArgs = {"-server", "-Xms512M", "-Xmx512M"}) +public class LongestCommonSubsequencePerformance { + @State(Scope.Benchmark) + public static class InputData { + final List<Pair<CharSequence, CharSequence>> inputs = new ArrayList<>(); + + @Setup(Level.Trial) + public void setup() { + final ClassLoader classloader = Thread.currentThread().getContextClassLoader(); + try (InputStream is = classloader.getResourceAsStream("lcs-perf-analysis-inputs.csv"); + InputStreamReader isr = new InputStreamReader(Objects.requireNonNull(is)); + BufferedReader br = new BufferedReader(isr)) { + String line; + while ((line = br.readLine()) != null && !(line = line.trim()).isEmpty()) { + final int indexOfComma = line.indexOf(','); + final String inputA = line.substring(0, indexOfComma); + final String inputB = line.substring(1 + indexOfComma); + this.inputs.add(ImmutablePair.of(inputA, inputB)); + } + } catch (final IOException exception) { + throw new RuntimeException(exception.getMessage(), exception.getCause()); + } + } + } + + @Benchmark + public void testLCSLenBaseline(final InputData data) { + final BaselineLongestCommonSubsequence lcs = new BaselineLongestCommonSubsequence(); + for (final Pair<CharSequence, CharSequence> input : data.inputs) { + lcs.apply(input.getLeft(), input.getRight()); + } + } + + @Benchmark + public void testLCSBaseline(final InputData data) { + final BaselineLongestCommonSubsequence lcs = new BaselineLongestCommonSubsequence(); + for (final Pair<CharSequence, CharSequence> input : data.inputs) { + lcs.longestCommonSubsequence(input.getLeft(), input.getRight()); + } + } + + @Benchmark + public void testLCSLen(final InputData data) { + final LongestCommonSubsequence lcs = new LongestCommonSubsequence(); + for (final Pair<CharSequence, CharSequence> input : data.inputs) { + lcs.apply(input.getLeft(), input.getRight()); + } + } + + @Benchmark + public void testLCS(final InputData data) { + final LongestCommonSubsequence lcs = new LongestCommonSubsequence(); + for (final Pair<CharSequence, CharSequence> input : data.inputs) { + lcs.longestCommonSubsequence(input.getLeft(), input.getRight()); + } + } + + /** + * Older implementation of LongestCommonSubsequence. + * Code is copied from Apache Commons Text version 1.10.0-SNAPSHOT + */ + private static class BaselineLongestCommonSubsequence implements SimilarityScore<Integer> { + @Override + public Integer apply(final CharSequence left, final CharSequence right) { + if (left == null || right == null) { + throw new IllegalArgumentException("Inputs must not be null"); + } + return longestCommonSubsequence(left, right).length(); + } + + public CharSequence longestCommonSubsequence(final CharSequence left, final CharSequence right) { + if (left == null || right == null) { + throw new IllegalArgumentException("Inputs must not be null"); + } + final StringBuilder longestCommonSubstringArray = new StringBuilder(Math.max(left.length(), right.length())); + final int[][] lcsLengthArray = longestCommonSubstringLengthArray(left, right); + int i = left.length() - 1; + int j = right.length() - 1; + int k = lcsLengthArray[left.length()][right.length()] - 1; + while (k >= 0) { + if (left.charAt(i) == right.charAt(j)) { + longestCommonSubstringArray.append(left.charAt(i)); + i = i - 1; + j = j - 1; + k = k - 1; + } else if (lcsLengthArray[i + 1][j] < lcsLengthArray[i][j + 1]) { + i = i - 1; + } else { + j = j - 1; + } + } + return longestCommonSubstringArray.reverse().toString(); + } + + public int[][] longestCommonSubstringLengthArray(final CharSequence left, final CharSequence right) { + final int[][] lcsLengthArray = new int[left.length() + 1][right.length() + 1]; + for (int i = 0; i < left.length(); i++) { + for (int j = 0; j < right.length(); j++) { + if (i == 0) { + lcsLengthArray[i][j] = 0; + } + if (j == 0) { + lcsLengthArray[i][j] = 0; + } + if (left.charAt(i) == right.charAt(j)) { + lcsLengthArray[i + 1][j + 1] = lcsLengthArray[i][j] + 1; + } else { + lcsLengthArray[i + 1][j + 1] = Math.max(lcsLengthArray[i + 1][j], lcsLengthArray[i][j + 1]); + } + } + } + return lcsLengthArray; + } + } +} diff --git a/src/test/resources/lcs-perf-analysis-inputs.csv b/src/test/resources/lcs-perf-analysis-inputs.csv new file mode 100644 index 0000000..d000a26 --- /dev/null +++ b/src/test/resources/lcs-perf-analysis-inputs.csv @@ -0,0 +1,5 @@ +"This code is free software; you can redistribute it and/or modify it","under the terms of the GNU General Public License version 2 only, as" +"You should have received a copy of the GNU General Public License version","2 along with this work; if not, write to the Free Software Foundation," +"Here, the field iterations will be populated with appropriate values from the @Param annotation by the JMH when it is passed to the benchmark method. The @Setup annotated method is invoked before each invocation of the benchmark and creates a new Hasher ensuring isolation. When the execution is finished, we'll get a result similar to the one below: When running microbenchmarks, it's very important to be aware of optimizations. Otherwise, they may affect the","benchmark results in a very [...] +"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. This code is free software; you can redistrib [...] +"But I must explain to you how all this mistaken idea of denouncing pleasure and praising pain was born and I will give you a complete account of the system, and expound the actual teachings of the great explorer of the truth, the master-builder of human happiness. No one rejects, dislikes, or avoids pleasure itself, because it is pleasure, but because those who do not know how to pursue pleasure rationally encounter consequences that are extremely painful. Nor again is there anyone who [...]
