Thanks for the test data. I installed a Gnulib patch, noted here:
http://lists.gnu.org/archive/html/bug-gnulib/2016-10/msg00157.html
along with the attached diffutils patches. This fixed the problem for me, so I
am closing the bug report.
>From 571f01c069dfc7f860e5500a5d08ebfdaf9c3068 Mon Sep 17 00:00:00 2001
From: Paul Eggert <[email protected]>
Date: Tue, 25 Oct 2016 21:52:31 -0700
Subject: [PATCH 1/2] build: update gnulib submodule to latest
---
gnulib | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/gnulib b/gnulib
index e63f5eb..d55ed06 160000
--- a/gnulib
+++ b/gnulib
@@ -1 +1 @@
-Subproject commit e63f5eb55570a1ea3e51ce47df33689785e085c1
+Subproject commit d55ed0636eefeb43ad1aa8123416c33839652646
--
2.7.4
>From 68b82f6f8419a815cfcf962b3061352d414dc606 Mon Sep 17 00:00:00 2001
From: Paul Eggert <[email protected]>
Date: Tue, 25 Oct 2016 21:57:56 -0700
Subject: [PATCH 2/2] diff: fix big performance degradation in 3.4
* NEWS, doc/diffutils.texi (Overview): Document this.
* src/analyze.c (diff_2_files): Restore too_expensive heuristic,
but this time with a floor that is 16 times the old floor. This
should fix Bug#16848, by generating good-quality output for its
test case, while not introducing Bug#24715, by running nearly as
fast as diff-3.3 for that test case.
---
NEWS | 5 +++++
doc/diffutils.texi | 5 ++++-
src/analyze.c | 11 ++++++++++-
3 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/NEWS b/NEWS
index 473332e..88f5d2b 100644
--- a/NEWS
+++ b/NEWS
@@ -6,6 +6,11 @@ GNU diffutils NEWS -*- outline -*-
diff no longer mishandles line numbers exceeding 2**31 on Mingw-w64.
+** Performance changes
+
+ diff's default algorithm has been tweaked to deal better with larger
+ files, reversing some of the changes made in diffutils-3.4.
+
* Noteworthy changes in release 3.5 (2016-08-20) [stable]
diff --git a/doc/diffutils.texi b/doc/diffutils.texi
index b478380..ceaf557 100644
--- a/doc/diffutils.texi
+++ b/doc/diffutils.texi
@@ -161,7 +161,10 @@ The algorithm was independently discovered as described by Esko Ukkonen in
@c Date: Wed, 29 Sep 1993 08:27:55 MST
@c Ukkonen should be given credit for also discovering the algorithm used
@c in GNU diff.
-Related algorithms are surveyed by Alfred V. Aho in
+Unless the @option{--minimal} option is used, @command{diff} uses a
+heuristic by Paul Eggert that limits the cost to @math{O(N^1.5 log N)}
+at the price of producing suboptimal output for large inputs with many
+differences. Related algorithms are surveyed by Alfred V. Aho in
section 6.3 of ``Algorithms for Finding Patterns in Strings'',
@cite{Handbook of Theoretical Computer Science} (Jan Van Leeuwen,
ed.), Vol.@: A, @cite{Algorithms and Complexity}, Elsevier/MIT Press,
diff --git a/src/analyze.c b/src/analyze.c
index 3981872..847b8b0 100644
--- a/src/analyze.c
+++ b/src/analyze.c
@@ -534,6 +534,7 @@ diff_2_files (struct comparison *cmp)
{
struct context ctxt;
lin diags;
+ lin too_expensive;
/* Allocate vectors for the results of comparison:
a flag for each line of each file, saying whether that line
@@ -565,11 +566,19 @@ diff_2_files (struct comparison *cmp)
ctxt.heuristic = speed_large_files;
+ /* Set TOO_EXPENSIVE to be the approximate square root of the
+ input size, bounded below by 4096. 4096 seems to be good for
+ circa-2016 CPUs; see Bug#16848 and Bug#24715. */
+ too_expensive = 1;
+ for (; diags != 0; diags >>= 2)
+ too_expensive <<= 1;
+ ctxt.too_expensive = MAX (4096, too_expensive);
+
files[0] = cmp->file[0];
files[1] = cmp->file[1];
compareseq (0, cmp->file[0].nondiscarded_lines,
- 0, cmp->file[1].nondiscarded_lines, &ctxt);
+ 0, cmp->file[1].nondiscarded_lines, minimal, &ctxt);
free (ctxt.fdiag - (cmp->file[1].nondiscarded_lines + 1));
--
2.7.4