jenkins-bot has submitted this change and it was merged.

Change subject: Rethink diff limits
......................................................................


Rethink diff limits

Now, instead of "if your changed paragraphs are larger than 10Kb, you're
screwed":
* Instead of relying on overall length, estimate complexity after splitting to 
words
  and taking any equal head and tail out of equation.
* Estimate based on words changed, which better reflects the actual complexity
  of generating a diff.
* New limit is determined scientifically, i.e. "above that number XDebug starts
  complaining about recursion limits reached in Vagrant".

Caveat: if new limits are hit, the consequences are more widespread as all 
adjacent
changed paragraphs are displayed without word level diffs, as opposed to only 
the
paragraph that's too long being affected. However, the new limit is much higher 
and
in wikitext you're supposed to put empty lines between paragraphs anyway, 
negating
this problem.

Bug: T128697
Change-Id: I4e91c7c40f5afdd116b847a859b8517522302489
---
M RELEASE-NOTES-1.28
M autoload.php
A includes/diff/ComplexityException.php
M includes/diff/DairikiDiff.php
M includes/diff/DiffEngine.php
M includes/diff/WordLevelDiff.php
6 files changed, 76 insertions(+), 19 deletions(-)

Approvals:
  Aaron Schulz: Looks good to me, approved
  jenkins-bot: Verified



diff --git a/RELEASE-NOTES-1.28 b/RELEASE-NOTES-1.28
index 6348ac3..a650a50 100644
--- a/RELEASE-NOTES-1.28
+++ b/RELEASE-NOTES-1.28
@@ -43,7 +43,7 @@
 changes to languages because of Phabricator reports.
 
 === Other changes in 1.28 ===
-
+* (T128697) Improved handling of large diffs.
 
 == Compatibility ==
 
diff --git a/autoload.php b/autoload.php
index 8ff952c..27da2ca 100644
--- a/autoload.php
+++ b/autoload.php
@@ -830,6 +830,7 @@
        'MediaWiki\\Auth\\Throttler' => __DIR__ . 
'/includes/auth/Throttler.php',
        'MediaWiki\\Auth\\UserDataAuthenticationRequest' => __DIR__ . 
'/includes/auth/UserDataAuthenticationRequest.php',
        'MediaWiki\\Auth\\UsernameAuthenticationRequest' => __DIR__ . 
'/includes/auth/UsernameAuthenticationRequest.php',
+       'MediaWiki\\Diff\\ComplexityException' => __DIR__ . 
'/includes/diff/ComplexityException.php',
        'MediaWiki\\Diff\\WordAccumulator' => __DIR__ . 
'/includes/diff/WordAccumulator.php',
        'MediaWiki\\Interwiki\\ClassicInterwikiLookup' => __DIR__ . 
'/includes/interwiki/ClassicInterwikiLookup.php',
        'MediaWiki\\Interwiki\\InterwikiLookup' => __DIR__ . 
'/includes/interwiki/InterwikiLookup.php',
diff --git a/includes/diff/ComplexityException.php 
b/includes/diff/ComplexityException.php
new file mode 100644
index 0000000..10ca964
--- /dev/null
+++ b/includes/diff/ComplexityException.php
@@ -0,0 +1,30 @@
+<?php
+/**
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, 
USA.
+ * http://www.gnu.org/copyleft/gpl.html
+ *
+ * @file
+ * @ingroup DifferenceEngine
+ */
+
+namespace MediaWiki\Diff;
+
+use Exception;
+
+class ComplexityException extends Exception {
+       public function __construct() {
+               parent::__construct( 'Diff is too complex to generate' );
+       }
+}
diff --git a/includes/diff/DairikiDiff.php b/includes/diff/DairikiDiff.php
index dbb32e6..d38319e 100644
--- a/includes/diff/DairikiDiff.php
+++ b/includes/diff/DairikiDiff.php
@@ -205,15 +205,23 @@
        public $edits;
 
        /**
+        * @var int If this diff complexity is exceeded, a ComplexityException 
is thrown
+        *          0 means no limit.
+        */
+       protected $bailoutComplexity = 0;
+
+       /**
         * Constructor.
         * Computes diff between sequences of strings.
         *
         * @param string[] $from_lines An array of strings.
         *   Typically these are lines from a file.
         * @param string[] $to_lines An array of strings.
+        * @throws \MediaWiki\Diff\ComplexityException
         */
        public function __construct( $from_lines, $to_lines ) {
                $eng = new DiffEngine;
+               $eng->setBailoutComplexity( $this->bailoutComplexity );
                $this->edits = $eng->diff( $from_lines, $to_lines );
        }
 
diff --git a/includes/diff/DiffEngine.php b/includes/diff/DiffEngine.php
index 1853b86..babd00b 100644
--- a/includes/diff/DiffEngine.php
+++ b/includes/diff/DiffEngine.php
@@ -22,6 +22,7 @@
  * @file
  * @ingroup DifferenceEngine
  */
+use MediaWiki\Diff\ComplexityException;
 
 /**
  * This diff implementation is mainly lifted from the LCS algorithm of the 
Eclipse project which
@@ -51,6 +52,8 @@
        private $tooLong;
        private $powLimit;
 
+       protected $bailoutComplexity = 0;
+
        // State variables
        private $maxDifferences;
        private $lcsLengthCorrectedForHeuristic = false;
@@ -71,6 +74,7 @@
         *
         * @param string[] $from_lines
         * @param string[] $to_lines
+        * @throws ComplexityException
         *
         * @return DiffOp[]
         */
@@ -126,6 +130,14 @@
                }
 
                return $edits;
+       }
+
+       /**
+        * Sets the complexity (in comparison operations) that can't be exceeded
+        * @param int $value
+        */
+       public function setBailoutComplexity( $value ) {
+               $this->bailoutComplexity = $value;
        }
 
        /**
@@ -265,6 +277,7 @@
        /**
         * @param string[] $from
         * @param string[] $to
+        * @throws ComplexityException
         */
        protected function diffInternal( array $from, array $to ) {
                // remember initial lengths
@@ -323,6 +336,10 @@
                $this->m = count( $this->from );
                $this->n = count( $this->to );
 
+               if ( $this->bailoutComplexity > 0 && $this->m * $this->n > 
$this->bailoutComplexity ) {
+                       throw new ComplexityException();
+               }
+
                $this->removed = $this->m > 0 ? array_fill( 0, $this->m, true ) 
: [];
                $this->added = $this->n > 0 ? array_fill( 0, $this->n, true ) : 
[];
 
diff --git a/includes/diff/WordLevelDiff.php b/includes/diff/WordLevelDiff.php
index 12cf376..296e3b7 100644
--- a/includes/diff/WordLevelDiff.php
+++ b/includes/diff/WordLevelDiff.php
@@ -23,6 +23,7 @@
  * @defgroup DifferenceEngine DifferenceEngine
  */
 
+use MediaWiki\Diff\ComplexityException;
 use MediaWiki\Diff\WordAccumulator;
 
 /**
@@ -31,7 +32,10 @@
  * @ingroup DifferenceEngine
  */
 class WordLevelDiff extends \Diff {
-       const MAX_LINE_LENGTH = 10000;
+       /**
+        * @inheritdoc
+        */
+       protected $bailoutComplexity = 40000000; // Roughly 6K x 6K words 
changed
 
        /**
         * @param string[] $linesBefore
@@ -42,7 +46,12 @@
                list( $wordsBefore, $wordsBeforeStripped ) = $this->split( 
$linesBefore );
                list( $wordsAfter, $wordsAfterStripped ) = $this->split( 
$linesAfter );
 
-               parent::__construct( $wordsBeforeStripped, $wordsAfterStripped 
);
+               try {
+                       parent::__construct( $wordsBeforeStripped, 
$wordsAfterStripped );
+               } catch ( ComplexityException $ex ) {
+                       // Too hard to diff, just show whole paragraph(s) as 
changed
+                       $this->edits = [ new DiffOpChange( $linesBefore, 
$linesAfter ) ];
+               }
 
                $xi = $yi = 0;
                $editCount = count( $this->edits );
@@ -73,28 +82,20 @@
                $stripped = [];
                $first = true;
                foreach ( $lines as $line ) {
-                       # If the line is too long, just pretend the entire line 
is one big word
-                       # This prevents resource exhaustion problems
                        if ( $first ) {
                                $first = false;
                        } else {
                                $words[] = "\n";
                                $stripped[] = "\n";
                        }
-                       if ( strlen( $line ) > self::MAX_LINE_LENGTH ) {
-                               $words[] = $line;
-                               $stripped[] = $line;
-                       } else {
-                               $m = [];
-                               if ( preg_match_all( '/ ( [^\S\n]+ | 
[0-9_A-Za-z\x80-\xff]+ | . ) (?: (?!< \n) [^\S\n])? /xs',
-                                       $line, $m )
-                               ) {
-                                       foreach ( $m[0] as $word ) {
-                                               $words[] = $word;
-                                       }
-                                       foreach ( $m[1] as $stripped_word ) {
-                                               $stripped[] = $stripped_word;
-                                       }
+                       $m = [];
+                       if ( preg_match_all( '/ ( [^\S\n]+ | 
[0-9_A-Za-z\x80-\xff]+ | . ) (?: (?!< \n) [^\S\n])? /xs',
+                               $line, $m ) ) {
+                               foreach ( $m[0] as $word ) {
+                                       $words[] = $word;
+                               }
+                               foreach ( $m[1] as $stripped_word ) {
+                                       $stripped[] = $stripped_word;
                                }
                        }
                }

-- 
To view, visit https://gerrit.wikimedia.org/r/285567
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I4e91c7c40f5afdd116b847a859b8517522302489
Gerrit-PatchSet: 11
Gerrit-Project: mediawiki/core
Gerrit-Branch: master
Gerrit-Owner: MaxSem <[email protected]>
Gerrit-Reviewer: Aaron Schulz <[email protected]>
Gerrit-Reviewer: Addshore <[email protected]>
Gerrit-Reviewer: EBernhardson <[email protected]>
Gerrit-Reviewer: Kaldari <[email protected]>
Gerrit-Reviewer: Legoktm <[email protected]>
Gerrit-Reviewer: Yurik <[email protected]>
Gerrit-Reviewer: jenkins-bot <>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to