[MediaWiki-commits] [Gerrit] Remove first letters that have an overlapping prefix. - change (mediawiki/core)

jenkins-bot (Code Review) Mon, 08 Apr 2013 16:05:53 -0700

jenkins-bot has submitted this change and it was merged.

Change subject: Remove first letters that have an overlapping prefix.
......................................................................



Remove first letters that have an overlapping prefix.

First letters are supposed to be primary collation elements.
However, we do not want expansions to be considered
as firstletters (aka thorn "þ" -> "th" which isn't
the same as any other first letter (since "t" !== "th" )
however if þ was a first letter, the word "the" and
even worse the word "too" would be sorted under it, which
is wrong.

Looking for feedback if this all sounds sane. I have tested
it, it got rid of the contractions while at the same time
not removing any letter it wasn't supposed to.

Once this is merged, we could get rid of all the
-<langcode> entries. The other firstLetter array
entries for tailorings could be merged into
generateCollationData.php too, since incorrect
things would get pruned automatically, which
would probably make the logic in Collation.php
simpler.

Bug: 43740
Change-Id: I4bd3d39ec2938a53e2c6728adc48ee6cf9778d74
---
M includes/Collation.php
A tests/phpunit/includes/CollationTest.php
2 files changed, 175 insertions(+), 2 deletions(-)

Approvals:
  Tim Starling: Looks good to me, approved
  jenkins-bot: Verified



diff --git a/includes/Collation.php b/includes/Collation.php
index 107e8f9..21448f7 100644
--- a/includes/Collation.php
+++ b/includes/Collation.php
@@ -208,14 +208,12 @@
                'be-tarask' => array( "Ё" ),
                'en' => array(),
                'fi' => array( "Å", "Ä", "Ö" ),
-               '-fi' => array( "Ǥ", "Ŋ", "Ŧ", "Ʒ" ), // sorted like G, N, T, Z 
- bug 46330
                'hu' => array( "Cs", "Dz", "Dzs", "Gy", "Ly", "Ny", "Ö", "Sz", 
"Ty", "Ü", "Zs" ),
                'it' => array(),
                'pl' => array( "Ą", "Ć", "Ę", "Ł", "Ń", "Ó", "Ś", "Ź", "Ż" ),
                'pt' => array(),
                'ru' => array(),
                'sv' => array( "Å", "Ä", "Ö" ),
-               '-sv' => array( "Þ" ), // sorted as "th" in Swedish, causing 
unexpected output - bug 45446
                'uk' => array( "Ґ", "Ь" ),
                'vi' => array( "Ă", "Â", "Đ", "Ê", "Ô", "Ơ", "Ư" ),
                // Not verified, but likely correct
@@ -397,6 +395,72 @@
                        }
                }
                ksort( $letterMap, SORT_STRING );
+               // Remove duplicate prefixes. Basically if something has a 
sortkey
+               // which is a prefix of some other sortkey, then it is an
+               // expansion and probably should not be considered a section
+               // header.
+               //
+               // For example 'þ' is sometimes sorted as if it is the letters
+               // 'th'. Other times it is its own primary element. Another
+               // example is '₨'. Sometimes its a currency symbol. Sometimes it
+               // is an 'R' followed by an 's'.
+               //
+               // Additionally an expanded element should always sort directly
+               // after its first element due to they way sortkeys work.
+               //
+               // UCA sortkey elements are of variable length but no collation
+               // element should be a prefix of some other element, so I think
+               // this is safe. See:
+               // * 
https://ssl.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm
+               // * 
http://site.icu-project.org/design/collation/uca-weight-allocation
+               //
+               // Additionally, there is something called primary compression 
to
+               // worry about. Basically, if you have two primary elements that
+               // are more than one byte and both start with the same byte then
+               // the first byte is dropped on the second primary. Additionally
+               // either \x03 or \xFF may be added to mean that the next 
primary
+               // does not start with the first byte of the first primary.
+               //
+               // This shouldn't matter much, as the first primary is not
+               // changed, and that is what we are comparing against.
+               //
+               // tl;dr: This makes some assumptions about how icu implements
+               // collations. It seems incredibly unlikely these assumptions
+               // will change, but nonetheless they are assumptions.
+
+               $prev = false;
+               $duplicatePrefixes = array();
+               foreach( $letterMap as $key => $value ) {
+                       // Remove terminator byte. Otherwise the prefix
+                       // comparison will get hung up on that.
+                       $trimmedKey = rtrim( $key, "\0" );
+                       if ( $prev === false || $prev === '' ) {
+                               $prev = $trimmedKey;
+                               // We don't yet have a collation element
+                               // to compare against, so continue.
+                               continue;
+                       }
+
+                       // Due to the fact the array is sorted, we only have
+                       // to compare with the element directly previous
+                       // to the current element (skipping expansions).
+                       // An element "X" will always sort directly
+                       // before "XZ" (Unless we have "XY", but we
+                       // do not update $prev in that case).
+                       if ( substr( $trimmedKey, 0, strlen( $prev ) ) === 
$prev ) {
+                               $duplicatePrefixes[] = $key;
+                               // If this is an expansion, we don't want to
+                               // compare the next element to this element,
+                               // but to what is currently $prev
+                               continue;
+                       }
+                       $prev = $trimmedKey;
+               }
+               foreach( $duplicatePrefixes as $badKey ) {
+                       wfDebug( "Removing '{$letterMap[$badKey]}' from first 
letters." );
+                       unset( $letterMap[$badKey] );
+                       // This code assumes that unsetting does not change 
sort order.
+               }
                $data = array(
                        'chars' => array_values( $letterMap ),
                        'keys' => array_keys( $letterMap ),
diff --git a/tests/phpunit/includes/CollationTest.php 
b/tests/phpunit/includes/CollationTest.php
new file mode 100644
index 0000000..c746208
--- /dev/null
+++ b/tests/phpunit/includes/CollationTest.php
@@ -0,0 +1,109 @@
+<?php
+class CollationTest extends MediaWikiLangTestCase {
+       protected function setUp() {
+               parent::setUp();
+               if ( !extension_loaded( 'intl' ) ) {
+                       $this->markTestSkipped( 'These tests require intl 
extension' );
+               }
+       }
+
+       /**
+        * Test to make sure, that if you
+        * have "X" and "XY", the binary
+        * sortkey also has "X" being a
+        * prefix of "XY". Our collation
+        * code makes this assumption.
+        *
+        * @param $lang String Language code for collator
+        * @param $base String Base string
+        * @param $extended String String containing base as a prefix.
+        *
+        * @dataProvider prefixDataProvider
+        */
+       function testIsPrefix( $lang, $base, $extended ) {
+               $cp = Collator::create( $lang );
+               $cp->setStrength( Collator::PRIMARY );
+               $baseBin = $cp->getSortKey( $base );
+               // Remove sortkey terminator
+               $baseBin = rtrim( $baseBin, "\0" );
+               $extendedBin = $cp->getSortKey( $extended );
+               $this->assertStringStartsWith( $baseBin, $extendedBin, "$base 
is not a prefix of $extended" );
+       }
+
+       function prefixDataProvider() {
+               return array(
+                       array( 'en', 'A', 'AA' ),
+                       array( 'en', 'A', 'AAA' ),
+                       array( 'en', 'Д', 'ДЂ' ),
+                       array( 'en', 'Д', 'ДA' ),
+                       // 'Ʒ' should expand to 'Z ' (note space).
+                       array( 'fi', 'Z', 'Ʒ' ),
+                       // 'Þ' should expand to 'th'
+                       array( 'sv', 't', 'Þ' ),
+                       // Javanese is a limited use alphabet, so should have 3 
bytes
+                       // per character, so do some tests with it.
+                       array( 'en', 'ꦲ', 'ꦲꦤ' ),
+                       array( 'en', 'ꦲ', 'ꦲД' ),
+                       array( 'en', 'A', 'Aꦲ' ),
+               );
+       }
+       /**
+        * Opposite of testIsPrefix
+        *
+        * @dataProvider notPrefixDataProvider
+        */
+       function testNotIsPrefix( $lang, $base, $extended ) {
+               $cp = Collator::create( $lang );
+               $cp->setStrength( Collator::PRIMARY );
+               $baseBin = $cp->getSortKey( $base );
+               // Remove sortkey terminator
+               $baseBin = rtrim( $baseBin, "\0" );
+               $extendedBin = $cp->getSortKey( $extended );
+               $this->assertStringStartsNotWith( $baseBin, $extendedBin, 
"$base is a prefix of $extended" );
+       }
+
+       function notPrefixDataProvider() {
+               return array(
+                       array( 'en', 'A', 'B' ),
+                       array( 'en', 'AC', 'ABC' ),
+                       array( 'en', 'Z', 'Ʒ' ),
+                       array( 'en', 'A', 'ꦲ' ),
+               );
+       }
+
+       /**
+        * Test correct first letter is fetched.
+        *
+        * @param $collation String Collation name (aka uca-en)
+        * @param $string String String to get first letter of
+        * @param $firstLetter String Expected first letter.
+        *
+        * @dataProvider firstLetterProvider
+        */
+       function testGetFirstLetter( $collation, $string, $firstLetter ) {
+               $col = Collation::factory( $collation );
+               $this->assertEquals( $firstLetter, $col->getFirstLetter( 
$string ) );
+       }
+       function firstLetterProvider() {
+               return array(
+                       array( 'uppercase', 'Abc', 'A' ),
+                       array( 'uppercase', 'abc', 'A' ),
+                       array( 'identity', 'abc', 'a' ),
+                       array( 'uca-en', 'abc', 'A' ),
+                       array( 'uca-en', ' ', ' ' ),
+                       array( 'uca-en', 'Êveryone', 'E' ),
+                       array( 'uca-vi', 'Êveryone', 'Ê' ),
+                       // Make sure thorn is not a first letter.
+                       array( 'uca-sv', 'The', 'T' ),
+                       array( 'uca-sv', 'Å', 'Å' ),
+                       array( 'uca-hu', 'dzsdo', 'Dzs' ),
+                       array( 'uca-hu', 'dzdso', 'Dz' ),
+                       array( 'uca-hu', 'CSD', 'Cs' ),
+                       array( 'uca-root', 'CSD', 'C' ),
+                       array( 'uca-fi', 'Ǥ', 'G' ),
+                       array( 'uca-fi', 'Ŧ', 'T' ),
+                       array( 'uca-fi', 'Ʒ', 'Z' ),
+                       array( 'uca-fi', 'Ŋ', 'N' ),
+               );
+       }
+}

-- 
To view, visit https://gerrit.wikimedia.org/r/55503
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I4bd3d39ec2938a53e2c6728adc48ee6cf9778d74
Gerrit-PatchSet: 8
Gerrit-Project: mediawiki/core
Gerrit-Branch: master
Gerrit-Owner: Brian Wolff <[email protected]>
Gerrit-Reviewer: Brian Wolff <[email protected]>
Gerrit-Reviewer: MarkAHershberger <[email protected]>
Gerrit-Reviewer: Matmarex <[email protected]>
Gerrit-Reviewer: Tim Starling <[email protected]>
Gerrit-Reviewer: jenkins-bot

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

[MediaWiki-commits] [Gerrit] Remove first letters that have an overlapping prefix. - change (mediawiki/core)

Reply via email to