ali-ghanbari commented on a change in pull request #233:
URL: https://github.com/apache/commons-text/pull/233#discussion_r674469665
##########
File path:
src/main/java/org/apache/commons/text/similarity/LongestCommonSubsequence.java
##########
@@ -120,24 +170,72 @@ public CharSequence longestCommonSubsequence(final
CharSequence left, final Char
if (left == null || right == null) {
throw new IllegalArgumentException("Inputs must not be null");
}
- final StringBuilder longestCommonSubstringArray = new
StringBuilder(Math.max(left.length(), right.length()));
- final int[][] lcsLengthArray = longestCommonSubstringLengthArray(left,
right);
- int i = left.length() - 1;
- int j = right.length() - 1;
- int k = lcsLengthArray[left.length()][right.length()] - 1;
- while (k >= 0) {
- if (left.charAt(i) == right.charAt(j)) {
- longestCommonSubstringArray.append(left.charAt(i));
- i = i - 1;
- j = j - 1;
- k = k - 1;
- } else if (lcsLengthArray[i + 1][j] < lcsLengthArray[i][j + 1]) {
- i = i - 1;
- } else {
- j = j - 1;
- }
+ // Find lengths of two strings
+ final int leftSz = left.length();
+ final int rightSz = right.length();
+
+ // Check if we can save even more space
+ if (leftSz < rightSz) {
+ return algorithmC(right, rightSz, left, leftSz);
}
- return longestCommonSubstringArray.reverse().toString();
+ return algorithmC(left, leftSz, right, rightSz);
+ }
+
+ /**
+ * An implementation of "ALG C" from Hirschberg's paper <a
href="https://dl.acm.org/doi/10.1145/360825.360861">A linear space algorithm
for computing maximal common subsequences</a>.
+ * Assuming the sequence <code>left</code> is of size <code>m</code> and
the sequence <code>right</code> is of size <code>n</code>,
+ * this method returns Longest Common Subsequence (LCS) the two sequences.
+ * As per the paper, this method runs in O(m * n) time and O(m + n) space.
+ *
+ * @param left Left sequence
+ * @param m Length of left sequence
+ * @param right Right sequence
+ * @param n Length of right sequence
+ * @return LCS of <code>left</code> and <code>right</code>
+ */
+ static CharSequence algorithmC(final CharSequence left, final int m,
Review comment:
> ... Algorithm A and Algorithm B are very similar, but I think A
requires O(mn) time and O(mn) space. While Algorithm B requires O(mn) time and
O(m+n) space.
Right, in terms of time complexity, yes, they all are quadratic (O(mn), to
be precise) in the size of input strings.
Algorithm B and C have O(m+n) space complexity, while Algorithm A has O(mn)
space complexity (it returns an entire DP array; just like the method that we
marked as deprecated).
> And, Algorithm C is the third one in that paper, with O(mn) time...
Right.
> Past midnight here, long time with ...
Forgot this comment!
> The LongestCommonSubsequence is a function supposed to be called as a
BiFunction, with apply() being the main method, and entry point to the
algorithm.
> With the changes here, I think now apply uses Algorithm B. While this
method with Algorithm C is called only from the public longestCommonSubsequence.
> Is there a reason for having both implementations here, instead of just
Algorithm C and calling it from the apply method?
Algorithm C calculates the LCS itself (not just its length). If the class
`LongestCommonSubsequence` is not supposed to calculate the LCS of two strings,
then there is no reason to keep Algorithm C.
To implement `apply` method, we could use Algorithm C, as well. We could
call `algorithmC` and take the length of the returned CharSequence and return
it, but that would involve extra computations; Algorithm B can be directly used.
Thanks @kinow your constructive comments and your hard work! 👍 🥇
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]