Tjones has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/327364 )

Change subject: Move ambiguity detection into main TextCat module.
......................................................................

Move ambiguity detection into main TextCat module.

Move ambiguity detection (using results ratio and max returned languages)
into the main TextCat module. These are still set with -u and -a in the
driver, catus.php

Moved status messages to string constants in TextCat.php, from catus.php, stored
in resultStatus variable. Added test cases for status messages.

Changed -t in catus.php to -m. The Perl original has two separate flags,
-t and -m, to control model size; catus.php only has one; -m is more
mnemonic for "model size".

General syntax tidying, esp. in TextCatTest.php.

Update README file.

Bug: T153105
Change-Id: I8bc83ccd4bcf0f064f2de43ea0b6d732def9b53f
---
M README.md
M TextCat.php
M catus.php
M tests/TextCatTest.php
4 files changed, 185 insertions(+), 56 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/wikimedia/textcat 
refs/changes/64/327364/1

diff --git a/README.md b/README.md
index a8d8be5..d7954a8 100644
--- a/README.md
+++ b/README.md
@@ -6,7 +6,9 @@
 
 This is a PHP port of the TextCat language guesser utility.
 
-Please see http://odur.let.rug.nl/~vannoord/TextCat/ for the original one.
+Please see http://odur.let.rug.nl/~vannoord/TextCat/ for the original Perl 
version.
+
+Please see https://github.com/Trey314159/TextCat for an updated Perl version.
 
 ## Contents
 
@@ -47,9 +49,11 @@
 
 Model names use [Wikipedia language 
codes](https://en.wikipedia.org/wiki/List_of_Wikipedias), which are often but 
not guaranteed to be the same as [ISO 639 language 
codes](https://en.wikipedia.org/wiki/ISO_639).
 
-When detecting languages, you will generally get better results when you can 
limit the number of language models in use. For example, if there is virtually 
no chance that your text could be in Irish Gaelic, including the Irish Gaelic 
language model (`ga`) only increases the likelihood of mis-identification. This 
is particularly true for closely related languages (e.g., the Romance 
languages, or English/`en` and Scots/`sco`).
+When detecting languages, you will generally get better results when you can 
limit the number of language models in use, especially with very short texts. 
For example, if there is virtually no chance that your text could be in Irish 
Gaelic, including the Irish Gaelic language model (`ga`) only increases the 
likelihood of mis-identification. This is particularly true for closely related 
languages (e.g., the Romance languages, or English/`en` and Scots/`sco`).
 
 Limiting the number of language models used also generally improves 
performance. You can copy your desired language models into a new directory 
(and use `-d` with `catus.php`) or specify your desired languages on the 
command line (use `-c` with `catus.php`).
+
+You can also combine models in multiple directories (e.g., to use the 
query-based models with a fallback to Wiki-Text-based models) with a 
comma-separated list of directories (use `-d` with `catus.php`). Directories 
are scanned in order, and only the first model found with a particular name 
will be used.
 
 ### Wiki-Text models
 
@@ -59,7 +63,7 @@
 
 These models have not been tested and are provided as-is. We may add new 
models or remove poorly-performing models in the future.
 
-These models have 4000 ngrams. The best number of ngrams to use for language 
identification is application-dependent. For larger texts (e.g., containing 
hundreds of words per sample), significantly smaller ngram sets may be best. 
You can set the number to be used by changing `$maxNgrams` in `TextCat.php` or 
in `felis.php`, or using `-t` with `catus.php`.
+These models have 4000 ngrams. The best number of ngrams to use for language 
identification is application-dependent. For larger texts (e.g., containing 
hundreds of words per sample), significantly smaller ngram sets may be best. 
You can set the number to be used by changing `$maxNgrams` in `TextCat.php` or 
in `felis.php`, or using `-m` with `catus.php`.
 
 ### Wiki Query Models.
 
@@ -69,7 +73,7 @@
 
 The final set of models provided is based in part on their performance on 
English Wikipedia queries (the first target for language ID using TextCat). For 
more details see our [initial 
report](https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_with_TextCat)
 on TextCat. More languages will be added in the future based on additional 
performance evaluations.
 
-These models have 5000 ngrams. The best number of ngrams to use for language 
identification is application-dependent. For larger texts (e.g., containing 
hundreds of words per sample), significantly smaller ngram sets may be best. 
For short query seen on English Wikipedia strings, a model size of 3000 ngrams 
has worked best. You can set the number to be used by changing `$maxNgrams` in 
`TextCat.php` or in `felis.php`, or using `-t` with `catus.php`.
+These models have 5000 ngrams. The best number of ngrams to use for language 
identification is application-dependent. For larger texts (e.g., containing 
hundreds of words per sample), significantly smaller ngram sets may be best. 
For short query seen on English Wikipedia strings, a model size of 3000 ngrams 
has worked best. You can set the number to be used by changing `$maxNgrams` in 
`TextCat.php` or in `felis.php`, or using `-m` with `catus.php`.
 
 
 [![Build 
Status](https://travis-ci.org/smalyshev/textcat.svg?branch=master)](https://travis-ci.org/smalyshev/textcat)
diff --git a/TextCat.php b/TextCat.php
index 1ede010..1783fe6 100644
--- a/TextCat.php
+++ b/TextCat.php
@@ -6,6 +6,17 @@
  */
 class TextCat {
 
+       const statusTooShort = 'Input is too short.';
+       const statusNoMatch = 'No match found.';
+       const statusAmbiguous = 'Cannot determine language.';
+
+       /**
+        * Minimum input length to be considered for
+        * classification
+        * @var string
+        */
+       private $resultStatus = '';
+
        /**
         * Number of ngrams to be used.
         * @var int
@@ -33,11 +44,34 @@
        private $langFiles = array();
 
        /**
-        * Minimum Input Length to be considered for
+        * Minimum input length to be considered for
         * classification
         * @var int
         */
        private $minInputLength = 0;
+
+       /**
+        * Maximum ratio of the score between a given
+        * candidate and the best candidate for the
+        * given candidate to be considered an alternative.
+        * @var float
+        */
+       private $resultsRatio = 1.05;
+
+       /**
+        * Maximum number of languages to return, within
+        * the resultsRatio. If there are more, the result
+        * is too ambiguous.
+        * @var int
+        */
+       private $maxReturnedLanguages = 10;
+
+       /**
+        * @param
+        */
+       public function getResultStatus() {
+               return $this->resultStatus;
+       }
 
        /**
         * @param int $maxNgrams
@@ -58,6 +92,20 @@
         */
        public function setMinInputLength( $minInputLength ) {
                $this->minInputLength = $minInputLength;
+       }
+
+       /**
+        * @param float $resultsRatio
+        */
+       public function setResultsRatio( $resultsRatio ) {
+               $this->resultsRatio = $resultsRatio;
+       }
+
+       /**
+        * @param int $maxReturnedLanguages
+        */
+       public function setMaxReturnedLanguages( $maxReturnedLanguages ) {
+               $this->maxReturnedLanguages = $maxReturnedLanguages;
        }
 
        /**
@@ -170,10 +218,12 @@
         */
        public function classify( $text, $candidates = null ) {
                $results = array();
+               $this->resultStatus = '';
 
                // strip non-word characters before checking for min length, 
don't assess empty strings
                $wordLength = mb_strlen( preg_replace( 
"/[{$this->wordSeparator}]+/", "", $text ) );
                if ( $wordLength < $this->minInputLength || $wordLength == 0 ) {
+                       $this->resultStatus = constant( 
'TextCat::statusTooShort' );
                        return $results;
                }
 
@@ -197,7 +247,25 @@
                        }
                        $results[$language] = $p;
                }
+
                asort( $results );
+
+               // ignore any item that scores higher than best * resultsRatio
+               $max = reset( $results ) * $this->resultsRatio;
+               $results = array_filter( $results, function ( $res ) use ( $max 
) { return $res <= $max;
+               } );
+
+               // if more than maxReturnedLanguages remain, the result is too 
ambiguous, so bail
+               if ( count( $results ) > $this->maxReturnedLanguages ) {
+                       $this->resultStatus = constant( 
'TextCat::statusAmbiguous' );
+                       return array();
+               }
+
+               if ( count( $results ) == 0 ) {
+                       $this->resultStatus = constant( 
'TextCat::statusNoMatch' );
+                       return $results;
+               }
+
                return $results;
        }
 }
diff --git a/catus.php b/catus.php
index 7be34ef..2a2cfa9 100644
--- a/catus.php
+++ b/catus.php
@@ -4,11 +4,11 @@
  */
 require_once __DIR__.'/TextCat.php';
 
-$options = getopt( 'a:c:d:f:j:l:t:u:h' );
+$options = getopt( 'a:c:d:f:j:l:m:u:h' );
 
 if ( isset( $options['h'] ) ) {
        $help = <<<HELP
-{$argv[0]} [-d Dir] [-c Lang] [-a Int] [-f Int] [-j Int] [-l Text] [-t Int] 
[-u Float]
+{$argv[0]} [-d Dir] [-c Lang] [-a Int] [-f Int] [-j Int] [-l Text] [-m Int] 
[-u Float]
 
     -a NUM  The program returns the best-scoring language together
             with all languages which are <N times worse (set by option -u).
@@ -32,7 +32,7 @@
     -l TEXT Indicates that input is given as an argument on the command line,
             e.g. {$argv[0]} -l "this is english text"
             If this option is not given, the input is stdin.
-    -t NUM  Indicates the topmost number of ngrams that should be used.
+    -m NUM  Indicates the topmost number of ngrams that should be used.
             Default: 3000
     -u NUM  Determines how much worse result must be in order not to be
             mentioned as an alternative. Typical value: 1.05 or 1.1.
@@ -51,8 +51,8 @@
 
 $cat = new TextCat( $dirs );
 
-if ( !empty( $options['t'] ) ) {
-       $cat->setMaxNgrams( intval( $options['t'] ) );
+if ( !empty( $options['m'] ) ) {
+       $cat->setMaxNgrams( intval( $options['m'] ) );
 }
 if ( !empty( $options['f'] ) ) {
        $cat->setMinFreq( intval( $options['f'] ) );
@@ -60,6 +60,13 @@
 if ( isset( $options['j'] ) ) {
        $cat->setMinInputLength( intval( $options['j'] ) );
 }
+if ( !empty( $options['u'] ) ) {
+       $cat->setResultsRatio( floatval( $options['u'] ) );
+}
+if ( isset( $options['a'] ) ) {
+       $cat->setMaxReturnedLanguages( intval( $options['a'] ) );
+}
+
 
 $input = isset( $options['l'] ) ? $options['l'] : file_get_contents( 
"php://stdin" );
 if ( !empty( $options['c'] ) ) {
@@ -69,28 +76,9 @@
 }
 
 if ( empty( $result ) ) {
-       echo "No match found.\n";
+       echo $cat->getResultStatus() . "\n";
        exit( 1 );
 }
 
-if ( !empty( $options['u'] ) ) {
-       $max = reset( $result ) * $options['u'];
-} else {
-       $max = reset( $result ) * 1.05;
-}
-
-if ( !empty( $options['a'] ) ) {
-       $top = $options['a'];
-} else {
-       $top = 10;
-}
-$result = array_filter( $result, function ( $res ) use( $max ) { return $res < 
$max;
-
-} );
-if ( $result && count( $result ) <= $top ) {
-       echo join( " OR ", array_keys( $result ) ) . "\n";
-       exit( 0 );
-} else {
-       echo "Cannot determine language.\n";
-       exit( 1 );
-}
+echo join( " OR ", array_keys( $result ) ) . "\n";
+exit( 0 );
diff --git a/tests/TextCatTest.php b/tests/TextCatTest.php
index c6ce0e7..24bd830 100644
--- a/tests/TextCatTest.php
+++ b/tests/TextCatTest.php
@@ -11,10 +11,20 @@
 
        public function setUp()
        {
-               // initialze testcat with a string, and multicats with arrays
+               // initialize testcat with a string
                $this->testcat = new TextCat( __DIR__."/data/Models" );
-               $this->multicat1 = new TextCat( array(__DIR__."/../LM", 
__DIR__."/../LM-query" ) );
-               $this->multicat2 = new TextCat( array(__DIR__."/../LM-query", 
__DIR__."/../LM" ) );
+
+               // initialize multicats with multi-element arrays
+               $this->multicat1 = new TextCat( array( __DIR__."/../LM", 
__DIR__."/../LM-query" ) );
+               $this->multicat2 = new TextCat( array( __DIR__."/../LM-query", 
__DIR__."/../LM" ) );
+
+               // effectively disable RR-based filtering for these cats
+               $this->multicat1->setResultsRatio( 100 );
+               $this->multicat2->setResultsRatio( 100 );
+
+               // initialize ambiguouscat with a one-element array
+               $this->ambiguouscat = new TextCat( array( 
__DIR__."/../LM-query" ) );
+
        }
 
        public function testCreateLM()
@@ -57,6 +67,15 @@
                $this->assertEquals( $result, $lm );
        }
 
+       public function testNoMatchFound()
+       {
+               # no xxx.lm model exists
+        $this->assertEquals( array_keys( $this->testcat->classify( "some 
string", array( "xxx" ) ) ),
+                                                        array() );
+               $this->assertEquals( $this->testcat->getResultStatus(), 
constant( 'TextCat::statusNoMatch' ) );
+       }
+
+
        public function getTexts()
        {
                $indir = __DIR__."/data/ShortTexts";
@@ -66,7 +85,7 @@
                        if ( !$file->isFile() || $file->getExtension() != "txt" 
) {
                                continue;
                        }
-                       $data[] = array( $file->getPathname(), $outdir . "/" . 
$file->getBasename(".txt") . ".lm" );
+                       $data[] = array( $file->getPathname(), $outdir . "/" . 
$file->getBasename( ".txt" ) . ".lm" );
                }
                return $data;
        }
@@ -81,7 +100,7 @@
                include $lmFile;
                $this->assertEquals(
                                $ngrams,
-                               $this->testcat->createLM( file_get_contents( 
$textFile ), 4000)
+                               $this->testcat->createLM( file_get_contents( 
$textFile ), 4000 )
                );
        }
 
@@ -106,21 +125,21 @@
     public function multiCatData()
     {
         return array(
-          array('this is english text français bisschen',
-                               array('sco', 'en', 'fr',  'de' ),
-                               array('fr',  'de', 'sco', 'en' ), ),
-          array('الاسم العلمي: Felis catu',
-                               array('ar', 'la', 'fa', 'fr' ),
-                               array('ar', 'fr', 'la', 'fa' ), ),
-          array('Кошка, или домашняя кошка A macska más néven házi macska',
-                               array('ru', 'uk', 'hu', 'fi' ),
-                               array('hu', 'ru', 'uk', 'fi' ), ),
-          array('Il gatto domestico Kucing disebut juga kucing domestik',
-                               array('id', 'it', 'pt', 'es' ),
-                               array('it', 'id', 'es', 'pt' ), ),
-          array('Domaća mačka Pisică de casă Hejma kato',
-                               array('hr', 'ro', 'eo', 'cs' ),
-                               array('hr', 'cs', 'ro', 'eo' ), ),
+          array( 'this is english text français bisschen',
+                               array( 'sco', 'en', 'fr',  'de' ),
+                               array( 'fr',  'de', 'sco', 'en' ), ),
+          array( 'الاسم العلمي: Felis catu',
+                               array( 'ar', 'la', 'fa', 'fr' ),
+                               array( 'ar', 'fr', 'la', 'fa' ), ),
+          array( 'Кошка, или домашняя кошка A macska más néven házi macska',
+                               array( 'ru', 'uk', 'hu', 'fi' ),
+                               array( 'hu', 'ru', 'uk', 'fi' ), ),
+          array( 'Il gatto domestico Kucing disebut juga kucing domestik',
+                               array( 'id', 'it', 'pt', 'es' ),
+                               array( 'it', 'id', 'es', 'pt' ), ),
+          array( 'Domaća mačka Pisică de casă Hejma kato',
+                               array( 'hr', 'ro', 'eo', 'cs' ),
+                               array( 'hr', 'cs', 'ro', 'eo' ), ),
         );
     }
 
@@ -165,13 +184,63 @@
                if ( !isset( $res ) ) {
                        $res = $lang;
                }
-               # should get results when min input len is 0
-               $minLength = $this->testcat->setMinInputLength(0);
+
+               // disable RR-based filtering
+               $this->testcat->setResultsRatio( 100 );
+
+               // should get results when min input len is 0
+               $this->testcat->setMinInputLength( 0 );
                $this->assertEquals( array_keys( $this->testcat->classify( 
$testLine, $res ) ),
                                                         array_values( $res ) );
-        # should get no results when min input len is more than the length of 
the string
-        $minLength = $this->testcat->setMinInputLength(mb_strlen($testLine) + 
1);
+               if ( !empty( $res ) ) {
+                       $this->assertEquals( $this->testcat->getResultStatus(), 
'' );
+               }
+
+        // should get no results when min input len is more than the length of 
the string
+        $this->testcat->setMinInputLength( mb_strlen( $testLine ) + 1 );
         $this->assertEquals( array_keys( $this->testcat->classify( $testLine, 
$res ) ),
                              array() );
+               $this->assertEquals( $this->testcat->getResultStatus(), 
constant( 'TextCat::statusTooShort' ) );
+
+               // reset to defaults
+               $this->testcat->setMinInputLength( 0 );
+               $this->testcat->setResultsRatio( 1.05 );
     }
+
+    public function ambiguityData()
+    {
+        return array(
+          array( 'espanol português', 1.05, 10, 3000, array( 'pt' ), '' ),
+          array( 'espanol português', 1.20, 10, 3000, array( 'pt', 'es' ), '' 
),
+          array( 'espanol português', 1.20,  2, 3000, array( 'pt', 'es' ), '' 
),
+          array( 'espanol português', 1.20,  1, 3000, array(), constant( 
'TextCat::statusAmbiguous' ) ),
+          array( 'espanol português', 1.30, 10, 3000, array( 'pt', 'es', 'fr', 
'it', 'en', 'pl' ), '' ),
+          array( 'espanol português', 1.30,  6, 3000, array( 'pt', 'es', 'fr', 
'it', 'en', 'pl' ), '' ),
+          array( 'espanol português', 1.30,  5, 3000, array(), constant( 
'TextCat::statusAmbiguous' ) ),
+          array( 'espanol português', 1.10, 20,  500,
+                       array( 'pt', 'es', 'it', 'fr', 'pl', 'cs', 'en', 'sv', 
'de', 'id', 'nl' ), '' ),
+          array( 'espanol português', 1.10, 20,  700, array( 'pt', 'es', 'it', 
'fr', 'en', 'de' ), '' ),
+          array( 'espanol português', 1.10, 20, 1000, array( 'pt', 'es', 'it', 
'fr' ), '' ),
+          array( 'espanol português', 1.10, 20, 2000, array( 'pt', 'es' ), '' 
),
+          array( 'espanol português', 1.10, 20, 3000, array( 'pt' ), '' ),
+        );
+    }
+
+    /**
+     * @dataProvider ambiguityData
+        * @param string $testLine
+        * @param array $lang
+        * @param array $res
+     */
+    public function testAmbiguity( $testLine, $resRatio, $maxRetLang, 
$modelSize, $results, $errMsg )
+    {
+               $this->ambiguouscat->setMaxNgrams( $modelSize );
+               $this->ambiguouscat->setResultsRatio( $resRatio );
+               $this->ambiguouscat->setMaxReturnedLanguages( $maxRetLang );
+
+               $this->assertEquals( array_keys( $this->ambiguouscat->classify( 
$testLine ) ),
+                                                        array_values( $results 
) );
+               $this->assertEquals( $this->ambiguouscat->getResultStatus(), 
$errMsg );
+    }
+
 }

-- 
To view, visit https://gerrit.wikimedia.org/r/327364
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I8bc83ccd4bcf0f064f2de43ea0b6d732def9b53f
Gerrit-PatchSet: 1
Gerrit-Project: wikimedia/textcat
Gerrit-Branch: master
Gerrit-Owner: Tjones <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to