[MediaWiki-commits] [Gerrit] (bug 46867) trim bad utf-8 sequences before normalizing. - change (mediawiki...Wikibase)

Daniel Kinzler (Code Review) Mon, 24 Jun 2013 03:49:19 -0700

Daniel Kinzler has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/70139



Change subject: (bug 46867) trim bad utf-8 sequences before normalizing.
......................................................................

(bug 46867) trim bad utf-8 sequences before normalizing.

This adds code to detect and remove incomplete utf-8 sequences at the beginning
and end of strings, when normalizing them for internal use, e.g. in search keys.

This is a) sensible by itself and b) needed to work around the fact that
preg_replace will return an empty string when encountering an invalid utf-8
sequence anywhere in the input.

Change-Id: I702e01b3f021bb2e86fb309e0d51db2a10475ac2
---
M lib/includes/Term.php
M lib/includes/Utils.php
M lib/tests/phpunit/TermTest.php
3 files changed, 131 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/Wikibase 
refs/changes/39/70139/1

diff --git a/lib/includes/Term.php b/lib/includes/Term.php
index 6b25347..1167c23 100644
--- a/lib/includes/Term.php
+++ b/lib/includes/Term.php
@@ -229,19 +229,42 @@
         * @return string
         */
        public static function normalizeText( $text, $lang = 'en' ) {
-               // \p{Z} - whitespace
-               // \p{C} - control chars
-               $text = preg_replace( '/^[\p{Z}\p{C}]+|[\p{Z}\p{C}]+$/u', '', 
$text );
-               $text = preg_replace( '/[\p{C}]+/u', ' ', $text );
+               if ( $text === '' ) {
+                       return '';
+               }
 
                // composed normal form
-               $text = Utils::cleanupToNFC( $text );
+               $nfcText = Utils::cleanupToNFC( $text );
+
+               if ( !is_string( $nfcText ) || $nfcText === '' ) {
+                       wfWarn( "Unicode normalization failed for `$text`" );
+               }
+
+               // \p{Z} - whitespace
+               // \p{C} - control chars
+               // WARNING: *any* invalid UTF8 sequence causes preg_replace to 
return an empty string.
+               $strippedText = $nfcText;
+               $strippedText = preg_replace( '/[\p{Cc}\p{Cf}\p{Cn}\p{Cs}]+/u', 
' ', $strippedText );
+               $strippedText = preg_replace( '/^[\p{Z}]+|[\p{Z}]+$/u', '', 
$strippedText );
+
+               if ( $strippedText === '' ) {
+                       // NOTE: This happens when there is only whitespace in 
the string.
+                       //       However, preg_replace will also return an 
empty string if it
+                       //       encounters any invalid utf-8 sequence.
+                       return '';
+               }
 
                //TODO: Use Language::lc to convert to lower case.
                //      But that requires us to load ALL the language objects,
                //      which loads ALL the messages, which makes us run out
                //      of RAM (see bug 41103).
-               return mb_strtolower( $text, 'UTF-8' );
+               $normalized = mb_strtolower( $strippedText, 'UTF-8' );
+
+               if ( !is_string( $normalized ) || $normalized === '' ) {
+                       wfWarn( "mb_strtolower normalization failed for 
`$strippedText`" );
+               }
+
+               return $normalized;
        }
 
        /**
diff --git a/lib/includes/Utils.php b/lib/includes/Utils.php
index c6ce523..dcc0a1e 100644
--- a/lib/includes/Utils.php
+++ b/lib/includes/Utils.php
@@ -189,6 +189,66 @@
        }
 
        /**
+        * Remove bytes that represent an incomplete Unicode character
+        * at the end of string (e.g. bytes of the char are missing)
+        *
+        * @todo: this was stolen from the Language class. Make that code 
reusable.
+        *
+        * @param $string String
+        * @return string
+        */
+       public static function removeBadCharLast( $string ) {
+               if ( $string != '' ) {
+                       $char = ord( $string[strlen( $string ) - 1] );
+                       $m = array();
+                       if ( $char >= 0xc0 ) {
+                               # We got the first byte only of a multibyte 
char; remove it.
+                               $string = substr( $string, 0, -1 );
+                       } elseif ( $char >= 0x80 &&
+                               preg_match( '/^(.*)(?:[\xe0-\xef][\x80-\xbf]|' .
+                                       '[\xf0-\xf7][\x80-\xbf]{1,2})$/', 
$string, $m )
+                       ) {
+                               # We chopped in the middle of a character; 
remove it
+                               $string = $m[1];
+                       }
+               }
+               return $string;
+       }
+
+       /**
+        * Remove bytes that represent an incomplete Unicode character
+        * at the start of string (e.g. bytes of the char are missing)
+        *
+        * @todo: this was stolen from the Language class. Make that code 
reusable.
+        *
+        * @param $string String
+        * @return string
+        */
+       public static function removeBadCharFirst( $string ) {
+               if ( $string != '' ) {
+                       $char = ord( $string[0] );
+                       if ( $char >= 0x80 && $char < 0xc0 ) {
+                               # We chopped in the middle of a character; 
remove the whole thing
+                               $string = preg_replace( '/^[\x80-\xbf]+/', '', 
$string );
+                       }
+               }
+               return $string;
+       }
+
+       /**
+        * Remove incomplete UTF-8 sequences from the beginning and end of the 
string.
+        *
+        * @param $string
+        *
+        * @return $string
+        */
+       public static function trimBadChars( $string ) {
+               $string = self::removeBadCharFirst( $string );
+               $string = self::removeBadCharLast( $string );
+               return $string;
+       }
+
+       /**
         * Trim initial and trailing whitespace and control chars, and 
optionally compress internal ones.
         *
         * @since 0.1
@@ -198,8 +258,11 @@
         * @return string where whitespace possibly are removed.
         */
        static public function trimWhitespace( $inputString ) {
+               $inputString = self::trimBadChars( $inputString );
+
                // \p{Z} - whitespace
                // \p{Cc} - control chars
+               // WARNING: *any* invalid UTF8 sequence causes preg_replace to 
return an empty string.
                $trimmed = preg_replace( '/^[\p{Z}\p{Cc}]+|[\p{Z}\p{Cc}]+$/u', 
'', $inputString );
                $trimmed = preg_replace( '/[\p{Cc}]+/u', ' ', $trimmed );
                return $trimmed;
@@ -212,10 +275,13 @@
         *
         * @param string $inputString The actual string to process.
         *
-        * @return string where whitespace possibly are removed.
+        * @return string NFC form of the input, with (some) invalid sequences 
removed.
         */
        static public function cleanupToNFC( $inputString ) {
-               return UtfNormal::cleanUp( $inputString );
+               $cleaned = $inputString;
+               $cleaned = self::trimBadChars( $cleaned );
+               $cleaned = UtfNormal::cleanUp( $cleaned );
+               return $cleaned;
        }
 
        /**
@@ -225,7 +291,7 @@
         *
         * @param string $inputString
         *
-        * @return string on NFC form
+        * @return string on NFC form and leading and trailing whitespace 
removed.
         */
        static public function trimToNFC( $inputString ) {
                return self::cleanupToNFC( self::trimWhitespace( $inputString ) 
);
diff --git a/lib/tests/phpunit/TermTest.php b/lib/tests/phpunit/TermTest.php
index 64331eb..7a9c3f2 100644
--- a/lib/tests/phpunit/TermTest.php
+++ b/lib/tests/phpunit/TermTest.php
@@ -110,6 +110,36 @@
                                
"\xE2\x80\x8F\xE2\x80\x8Cfoo\xE2\x80\x8Cbar\xE2\x80\xA9", // raw
                                "foo bar", // normalized
                        ),
+
+                       array( // #7: Private Use Area: U+0F818
+                               "\xef\xa0\x98",
+                               "\xef\xa0\x98"
+                       ),
+
+                       array( // #8: Latin Extended-D: U+0A7AA
+                               "\xea\x9e\xaa",
+                               "\xea\x9e\xaa",
+                       ),
+
+                       array( // #9: badly truncated cyrillic:
+                               "\xd0\xb5\xd0",
+                               "\xd0\xb5",
+                       ),
+
+                       array( // #10: badly truncated katakana:
+                               "\xe3\x82\xa6\xe3\x83",
+                               "\xe3\x82\xa6"
+                       ),
+
+                       array( // #11: empty
+                               "",
+                               ""
+                       ),
+
+                       array( // #12: just blanks
+                               " \n ",
+                               ""
+                       ),
                );
        }
 
@@ -120,7 +150,9 @@
                $term = new Term( array() );
 
                $term->setText( $raw );
-               $this->assertEquals( $normalized, $term->getNormalizedText() );
+
+               $actual = $term->getNormalizedText();
+               $this->assertEquals( $normalized, $actual );
        }
 
        public function testClone() {

-- 
To view, visit https://gerrit.wikimedia.org/r/70139
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I702e01b3f021bb2e86fb309e0d51db2a10475ac2
Gerrit-PatchSet: 1
Gerrit-Project: mediawiki/extensions/Wikibase
Gerrit-Branch: master
Gerrit-Owner: Daniel Kinzler <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

[MediaWiki-commits] [Gerrit] (bug 46867) trim bad utf-8 sequences before normalizing. - change (mediawiki...Wikibase)

Reply via email to