[bug-diffutils] bug#24715: bug#24715: Massive performance regression makes diff unusable

Paul Eggert Tue, 25 Oct 2016 23:14:17 -0700

Thanks for the test data. I installed a Gnulib patch, noted here:


http://lists.gnu.org/archive/html/bug-gnulib/2016-10/msg00157.html

along with the attached diffutils patches. This fixed the problem for me, so Iam closing the bug report.

>From 571f01c069dfc7f860e5500a5d08ebfdaf9c3068 Mon Sep 17 00:00:00 2001
From: Paul Eggert <[email protected]>
Date: Tue, 25 Oct 2016 21:52:31 -0700
Subject: [PATCH 1/2] build: update gnulib submodule to latest

---
 gnulib | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gnulib b/gnulib
index e63f5eb..d55ed06 160000
--- a/gnulib
+++ b/gnulib
@@ -1 +1 @@
-Subproject commit e63f5eb55570a1ea3e51ce47df33689785e085c1
+Subproject commit d55ed0636eefeb43ad1aa8123416c33839652646
-- 
2.7.4

>From 68b82f6f8419a815cfcf962b3061352d414dc606 Mon Sep 17 00:00:00 2001
From: Paul Eggert <[email protected]>
Date: Tue, 25 Oct 2016 21:57:56 -0700
Subject: [PATCH 2/2] diff: fix big performance degradation in 3.4

* NEWS, doc/diffutils.texi (Overview): Document this.
* src/analyze.c (diff_2_files): Restore too_expensive heuristic,
but this time with a floor that is 16 times the old floor.  This
should fix Bug#16848, by generating good-quality output for its
test case, while not introducing Bug#24715, by running nearly as
fast as diff-3.3 for that test case.
---
 NEWS               |  5 +++++
 doc/diffutils.texi |  5 ++++-
 src/analyze.c      | 11 ++++++++++-
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/NEWS b/NEWS
index 473332e..88f5d2b 100644
--- a/NEWS
+++ b/NEWS
@@ -6,6 +6,11 @@ GNU diffutils NEWS                                    -*- outline -*-
 
   diff no longer mishandles line numbers exceeding 2**31 on Mingw-w64.
 
+** Performance changes
+
+  diff's default algorithm has been tweaked to deal better with larger
+  files, reversing some of the changes made in diffutils-3.4.
+
 
 * Noteworthy changes in release 3.5 (2016-08-20) [stable]
 
diff --git a/doc/diffutils.texi b/doc/diffutils.texi
index b478380..ceaf557 100644
--- a/doc/diffutils.texi
+++ b/doc/diffutils.texi
@@ -161,7 +161,10 @@ The algorithm was independently discovered as described by Esko Ukkonen in
 @c Date: Wed, 29 Sep 1993 08:27:55 MST
 @c Ukkonen should be given credit for also discovering the algorithm used
 @c in GNU diff.
-Related algorithms are surveyed by Alfred V. Aho in
+Unless the @option{--minimal} option is used, @command{diff} uses a
+heuristic by Paul Eggert that limits the cost to @math{O(N^1.5 log N)}
+at the price of producing suboptimal output for large inputs with many
+differences.  Related algorithms are surveyed by Alfred V. Aho in
 section 6.3 of ``Algorithms for Finding Patterns in Strings'',
 @cite{Handbook of Theoretical Computer Science} (Jan Van Leeuwen,
 ed.), Vol.@: A, @cite{Algorithms and Complexity}, Elsevier/MIT Press,
diff --git a/src/analyze.c b/src/analyze.c
index 3981872..847b8b0 100644
--- a/src/analyze.c
+++ b/src/analyze.c
@@ -534,6 +534,7 @@ diff_2_files (struct comparison *cmp)
     {
       struct context ctxt;
       lin diags;
+      lin too_expensive;
 
       /* Allocate vectors for the results of comparison:
 	 a flag for each line of each file, saying whether that line
@@ -565,11 +566,19 @@ diff_2_files (struct comparison *cmp)
 
       ctxt.heuristic = speed_large_files;
 
+      /* Set TOO_EXPENSIVE to be the approximate square root of the
+	 input size, bounded below by 4096.  4096 seems to be good for
+	 circa-2016 CPUs; see Bug#16848 and Bug#24715.  */
+      too_expensive = 1;
+      for (;  diags != 0;  diags >>= 2)
+	too_expensive <<= 1;
+      ctxt.too_expensive = MAX (4096, too_expensive);
+
       files[0] = cmp->file[0];
       files[1] = cmp->file[1];
 
       compareseq (0, cmp->file[0].nondiscarded_lines,
-		  0, cmp->file[1].nondiscarded_lines, &ctxt);
+		  0, cmp->file[1].nondiscarded_lines, minimal, &ctxt);
 
       free (ctxt.fdiag - (cmp->file[1].nondiscarded_lines + 1));
 
-- 
2.7.4

[bug-diffutils] bug#24715: bug#24715: Massive performance regression makes diff unusable

Reply via email to