Tjones has uploaded a new change for review. ( https://gerrit.wikimedia.org/r/327364 )
Change subject: Move ambiguity detection into main TextCat module. ...................................................................... Move ambiguity detection into main TextCat module. Move ambiguity detection (using results ratio and max returned languages) into the main TextCat module. These are still set with -u and -a in the driver, catus.php Moved status messages to string constants in TextCat.php, from catus.php, stored in resultStatus variable. Added test cases for status messages. Changed -t in catus.php to -m. The Perl original has two separate flags, -t and -m, to control model size; catus.php only has one; -m is more mnemonic for "model size". General syntax tidying, esp. in TextCatTest.php. Update README file. Bug: T153105 Change-Id: I8bc83ccd4bcf0f064f2de43ea0b6d732def9b53f --- M README.md M TextCat.php M catus.php M tests/TextCatTest.php 4 files changed, 185 insertions(+), 56 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/wikimedia/textcat refs/changes/64/327364/1 diff --git a/README.md b/README.md index a8d8be5..d7954a8 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,9 @@ This is a PHP port of the TextCat language guesser utility. -Please see http://odur.let.rug.nl/~vannoord/TextCat/ for the original one. +Please see http://odur.let.rug.nl/~vannoord/TextCat/ for the original Perl version. + +Please see https://github.com/Trey314159/TextCat for an updated Perl version. ## Contents @@ -47,9 +49,11 @@ Model names use [Wikipedia language codes](https://en.wikipedia.org/wiki/List_of_Wikipedias), which are often but not guaranteed to be the same as [ISO 639 language codes](https://en.wikipedia.org/wiki/ISO_639). -When detecting languages, you will generally get better results when you can limit the number of language models in use. For example, if there is virtually no chance that your text could be in Irish Gaelic, including the Irish Gaelic language model (`ga`) only increases the likelihood of mis-identification. This is particularly true for closely related languages (e.g., the Romance languages, or English/`en` and Scots/`sco`). +When detecting languages, you will generally get better results when you can limit the number of language models in use, especially with very short texts. For example, if there is virtually no chance that your text could be in Irish Gaelic, including the Irish Gaelic language model (`ga`) only increases the likelihood of mis-identification. This is particularly true for closely related languages (e.g., the Romance languages, or English/`en` and Scots/`sco`). Limiting the number of language models used also generally improves performance. You can copy your desired language models into a new directory (and use `-d` with `catus.php`) or specify your desired languages on the command line (use `-c` with `catus.php`). + +You can also combine models in multiple directories (e.g., to use the query-based models with a fallback to Wiki-Text-based models) with a comma-separated list of directories (use `-d` with `catus.php`). Directories are scanned in order, and only the first model found with a particular name will be used. ### Wiki-Text models @@ -59,7 +63,7 @@ These models have not been tested and are provided as-is. We may add new models or remove poorly-performing models in the future. -These models have 4000 ngrams. The best number of ngrams to use for language identification is application-dependent. For larger texts (e.g., containing hundreds of words per sample), significantly smaller ngram sets may be best. You can set the number to be used by changing `$maxNgrams` in `TextCat.php` or in `felis.php`, or using `-t` with `catus.php`. +These models have 4000 ngrams. The best number of ngrams to use for language identification is application-dependent. For larger texts (e.g., containing hundreds of words per sample), significantly smaller ngram sets may be best. You can set the number to be used by changing `$maxNgrams` in `TextCat.php` or in `felis.php`, or using `-m` with `catus.php`. ### Wiki Query Models. @@ -69,7 +73,7 @@ The final set of models provided is based in part on their performance on English Wikipedia queries (the first target for language ID using TextCat). For more details see our [initial report](https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_with_TextCat) on TextCat. More languages will be added in the future based on additional performance evaluations. -These models have 5000 ngrams. The best number of ngrams to use for language identification is application-dependent. For larger texts (e.g., containing hundreds of words per sample), significantly smaller ngram sets may be best. For short query seen on English Wikipedia strings, a model size of 3000 ngrams has worked best. You can set the number to be used by changing `$maxNgrams` in `TextCat.php` or in `felis.php`, or using `-t` with `catus.php`. +These models have 5000 ngrams. The best number of ngrams to use for language identification is application-dependent. For larger texts (e.g., containing hundreds of words per sample), significantly smaller ngram sets may be best. For short query seen on English Wikipedia strings, a model size of 3000 ngrams has worked best. You can set the number to be used by changing `$maxNgrams` in `TextCat.php` or in `felis.php`, or using `-m` with `catus.php`. [](https://travis-ci.org/smalyshev/textcat) diff --git a/TextCat.php b/TextCat.php index 1ede010..1783fe6 100644 --- a/TextCat.php +++ b/TextCat.php @@ -6,6 +6,17 @@ */ class TextCat { + const statusTooShort = 'Input is too short.'; + const statusNoMatch = 'No match found.'; + const statusAmbiguous = 'Cannot determine language.'; + + /** + * Minimum input length to be considered for + * classification + * @var string + */ + private $resultStatus = ''; + /** * Number of ngrams to be used. * @var int @@ -33,11 +44,34 @@ private $langFiles = array(); /** - * Minimum Input Length to be considered for + * Minimum input length to be considered for * classification * @var int */ private $minInputLength = 0; + + /** + * Maximum ratio of the score between a given + * candidate and the best candidate for the + * given candidate to be considered an alternative. + * @var float + */ + private $resultsRatio = 1.05; + + /** + * Maximum number of languages to return, within + * the resultsRatio. If there are more, the result + * is too ambiguous. + * @var int + */ + private $maxReturnedLanguages = 10; + + /** + * @param + */ + public function getResultStatus() { + return $this->resultStatus; + } /** * @param int $maxNgrams @@ -58,6 +92,20 @@ */ public function setMinInputLength( $minInputLength ) { $this->minInputLength = $minInputLength; + } + + /** + * @param float $resultsRatio + */ + public function setResultsRatio( $resultsRatio ) { + $this->resultsRatio = $resultsRatio; + } + + /** + * @param int $maxReturnedLanguages + */ + public function setMaxReturnedLanguages( $maxReturnedLanguages ) { + $this->maxReturnedLanguages = $maxReturnedLanguages; } /** @@ -170,10 +218,12 @@ */ public function classify( $text, $candidates = null ) { $results = array(); + $this->resultStatus = ''; // strip non-word characters before checking for min length, don't assess empty strings $wordLength = mb_strlen( preg_replace( "/[{$this->wordSeparator}]+/", "", $text ) ); if ( $wordLength < $this->minInputLength || $wordLength == 0 ) { + $this->resultStatus = constant( 'TextCat::statusTooShort' ); return $results; } @@ -197,7 +247,25 @@ } $results[$language] = $p; } + asort( $results ); + + // ignore any item that scores higher than best * resultsRatio + $max = reset( $results ) * $this->resultsRatio; + $results = array_filter( $results, function ( $res ) use ( $max ) { return $res <= $max; + } ); + + // if more than maxReturnedLanguages remain, the result is too ambiguous, so bail + if ( count( $results ) > $this->maxReturnedLanguages ) { + $this->resultStatus = constant( 'TextCat::statusAmbiguous' ); + return array(); + } + + if ( count( $results ) == 0 ) { + $this->resultStatus = constant( 'TextCat::statusNoMatch' ); + return $results; + } + return $results; } } diff --git a/catus.php b/catus.php index 7be34ef..2a2cfa9 100644 --- a/catus.php +++ b/catus.php @@ -4,11 +4,11 @@ */ require_once __DIR__.'/TextCat.php'; -$options = getopt( 'a:c:d:f:j:l:t:u:h' ); +$options = getopt( 'a:c:d:f:j:l:m:u:h' ); if ( isset( $options['h'] ) ) { $help = <<<HELP -{$argv[0]} [-d Dir] [-c Lang] [-a Int] [-f Int] [-j Int] [-l Text] [-t Int] [-u Float] +{$argv[0]} [-d Dir] [-c Lang] [-a Int] [-f Int] [-j Int] [-l Text] [-m Int] [-u Float] -a NUM The program returns the best-scoring language together with all languages which are <N times worse (set by option -u). @@ -32,7 +32,7 @@ -l TEXT Indicates that input is given as an argument on the command line, e.g. {$argv[0]} -l "this is english text" If this option is not given, the input is stdin. - -t NUM Indicates the topmost number of ngrams that should be used. + -m NUM Indicates the topmost number of ngrams that should be used. Default: 3000 -u NUM Determines how much worse result must be in order not to be mentioned as an alternative. Typical value: 1.05 or 1.1. @@ -51,8 +51,8 @@ $cat = new TextCat( $dirs ); -if ( !empty( $options['t'] ) ) { - $cat->setMaxNgrams( intval( $options['t'] ) ); +if ( !empty( $options['m'] ) ) { + $cat->setMaxNgrams( intval( $options['m'] ) ); } if ( !empty( $options['f'] ) ) { $cat->setMinFreq( intval( $options['f'] ) ); @@ -60,6 +60,13 @@ if ( isset( $options['j'] ) ) { $cat->setMinInputLength( intval( $options['j'] ) ); } +if ( !empty( $options['u'] ) ) { + $cat->setResultsRatio( floatval( $options['u'] ) ); +} +if ( isset( $options['a'] ) ) { + $cat->setMaxReturnedLanguages( intval( $options['a'] ) ); +} + $input = isset( $options['l'] ) ? $options['l'] : file_get_contents( "php://stdin" ); if ( !empty( $options['c'] ) ) { @@ -69,28 +76,9 @@ } if ( empty( $result ) ) { - echo "No match found.\n"; + echo $cat->getResultStatus() . "\n"; exit( 1 ); } -if ( !empty( $options['u'] ) ) { - $max = reset( $result ) * $options['u']; -} else { - $max = reset( $result ) * 1.05; -} - -if ( !empty( $options['a'] ) ) { - $top = $options['a']; -} else { - $top = 10; -} -$result = array_filter( $result, function ( $res ) use( $max ) { return $res < $max; - -} ); -if ( $result && count( $result ) <= $top ) { - echo join( " OR ", array_keys( $result ) ) . "\n"; - exit( 0 ); -} else { - echo "Cannot determine language.\n"; - exit( 1 ); -} +echo join( " OR ", array_keys( $result ) ) . "\n"; +exit( 0 ); diff --git a/tests/TextCatTest.php b/tests/TextCatTest.php index c6ce0e7..24bd830 100644 --- a/tests/TextCatTest.php +++ b/tests/TextCatTest.php @@ -11,10 +11,20 @@ public function setUp() { - // initialze testcat with a string, and multicats with arrays + // initialize testcat with a string $this->testcat = new TextCat( __DIR__."/data/Models" ); - $this->multicat1 = new TextCat( array(__DIR__."/../LM", __DIR__."/../LM-query" ) ); - $this->multicat2 = new TextCat( array(__DIR__."/../LM-query", __DIR__."/../LM" ) ); + + // initialize multicats with multi-element arrays + $this->multicat1 = new TextCat( array( __DIR__."/../LM", __DIR__."/../LM-query" ) ); + $this->multicat2 = new TextCat( array( __DIR__."/../LM-query", __DIR__."/../LM" ) ); + + // effectively disable RR-based filtering for these cats + $this->multicat1->setResultsRatio( 100 ); + $this->multicat2->setResultsRatio( 100 ); + + // initialize ambiguouscat with a one-element array + $this->ambiguouscat = new TextCat( array( __DIR__."/../LM-query" ) ); + } public function testCreateLM() @@ -57,6 +67,15 @@ $this->assertEquals( $result, $lm ); } + public function testNoMatchFound() + { + # no xxx.lm model exists + $this->assertEquals( array_keys( $this->testcat->classify( "some string", array( "xxx" ) ) ), + array() ); + $this->assertEquals( $this->testcat->getResultStatus(), constant( 'TextCat::statusNoMatch' ) ); + } + + public function getTexts() { $indir = __DIR__."/data/ShortTexts"; @@ -66,7 +85,7 @@ if ( !$file->isFile() || $file->getExtension() != "txt" ) { continue; } - $data[] = array( $file->getPathname(), $outdir . "/" . $file->getBasename(".txt") . ".lm" ); + $data[] = array( $file->getPathname(), $outdir . "/" . $file->getBasename( ".txt" ) . ".lm" ); } return $data; } @@ -81,7 +100,7 @@ include $lmFile; $this->assertEquals( $ngrams, - $this->testcat->createLM( file_get_contents( $textFile ), 4000) + $this->testcat->createLM( file_get_contents( $textFile ), 4000 ) ); } @@ -106,21 +125,21 @@ public function multiCatData() { return array( - array('this is english text français bisschen', - array('sco', 'en', 'fr', 'de' ), - array('fr', 'de', 'sco', 'en' ), ), - array('الاسم العلمي: Felis catu', - array('ar', 'la', 'fa', 'fr' ), - array('ar', 'fr', 'la', 'fa' ), ), - array('Кошка, или домашняя кошка A macska más néven házi macska', - array('ru', 'uk', 'hu', 'fi' ), - array('hu', 'ru', 'uk', 'fi' ), ), - array('Il gatto domestico Kucing disebut juga kucing domestik', - array('id', 'it', 'pt', 'es' ), - array('it', 'id', 'es', 'pt' ), ), - array('Domaća mačka Pisică de casă Hejma kato', - array('hr', 'ro', 'eo', 'cs' ), - array('hr', 'cs', 'ro', 'eo' ), ), + array( 'this is english text français bisschen', + array( 'sco', 'en', 'fr', 'de' ), + array( 'fr', 'de', 'sco', 'en' ), ), + array( 'الاسم العلمي: Felis catu', + array( 'ar', 'la', 'fa', 'fr' ), + array( 'ar', 'fr', 'la', 'fa' ), ), + array( 'Кошка, или домашняя кошка A macska más néven házi macska', + array( 'ru', 'uk', 'hu', 'fi' ), + array( 'hu', 'ru', 'uk', 'fi' ), ), + array( 'Il gatto domestico Kucing disebut juga kucing domestik', + array( 'id', 'it', 'pt', 'es' ), + array( 'it', 'id', 'es', 'pt' ), ), + array( 'Domaća mačka Pisică de casă Hejma kato', + array( 'hr', 'ro', 'eo', 'cs' ), + array( 'hr', 'cs', 'ro', 'eo' ), ), ); } @@ -165,13 +184,63 @@ if ( !isset( $res ) ) { $res = $lang; } - # should get results when min input len is 0 - $minLength = $this->testcat->setMinInputLength(0); + + // disable RR-based filtering + $this->testcat->setResultsRatio( 100 ); + + // should get results when min input len is 0 + $this->testcat->setMinInputLength( 0 ); $this->assertEquals( array_keys( $this->testcat->classify( $testLine, $res ) ), array_values( $res ) ); - # should get no results when min input len is more than the length of the string - $minLength = $this->testcat->setMinInputLength(mb_strlen($testLine) + 1); + if ( !empty( $res ) ) { + $this->assertEquals( $this->testcat->getResultStatus(), '' ); + } + + // should get no results when min input len is more than the length of the string + $this->testcat->setMinInputLength( mb_strlen( $testLine ) + 1 ); $this->assertEquals( array_keys( $this->testcat->classify( $testLine, $res ) ), array() ); + $this->assertEquals( $this->testcat->getResultStatus(), constant( 'TextCat::statusTooShort' ) ); + + // reset to defaults + $this->testcat->setMinInputLength( 0 ); + $this->testcat->setResultsRatio( 1.05 ); } + + public function ambiguityData() + { + return array( + array( 'espanol português', 1.05, 10, 3000, array( 'pt' ), '' ), + array( 'espanol português', 1.20, 10, 3000, array( 'pt', 'es' ), '' ), + array( 'espanol português', 1.20, 2, 3000, array( 'pt', 'es' ), '' ), + array( 'espanol português', 1.20, 1, 3000, array(), constant( 'TextCat::statusAmbiguous' ) ), + array( 'espanol português', 1.30, 10, 3000, array( 'pt', 'es', 'fr', 'it', 'en', 'pl' ), '' ), + array( 'espanol português', 1.30, 6, 3000, array( 'pt', 'es', 'fr', 'it', 'en', 'pl' ), '' ), + array( 'espanol português', 1.30, 5, 3000, array(), constant( 'TextCat::statusAmbiguous' ) ), + array( 'espanol português', 1.10, 20, 500, + array( 'pt', 'es', 'it', 'fr', 'pl', 'cs', 'en', 'sv', 'de', 'id', 'nl' ), '' ), + array( 'espanol português', 1.10, 20, 700, array( 'pt', 'es', 'it', 'fr', 'en', 'de' ), '' ), + array( 'espanol português', 1.10, 20, 1000, array( 'pt', 'es', 'it', 'fr' ), '' ), + array( 'espanol português', 1.10, 20, 2000, array( 'pt', 'es' ), '' ), + array( 'espanol português', 1.10, 20, 3000, array( 'pt' ), '' ), + ); + } + + /** + * @dataProvider ambiguityData + * @param string $testLine + * @param array $lang + * @param array $res + */ + public function testAmbiguity( $testLine, $resRatio, $maxRetLang, $modelSize, $results, $errMsg ) + { + $this->ambiguouscat->setMaxNgrams( $modelSize ); + $this->ambiguouscat->setResultsRatio( $resRatio ); + $this->ambiguouscat->setMaxReturnedLanguages( $maxRetLang ); + + $this->assertEquals( array_keys( $this->ambiguouscat->classify( $testLine ) ), + array_values( $results ) ); + $this->assertEquals( $this->ambiguouscat->getResultStatus(), $errMsg ); + } + } -- To view, visit https://gerrit.wikimedia.org/r/327364 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I8bc83ccd4bcf0f064f2de43ea0b6d732def9b53f Gerrit-PatchSet: 1 Gerrit-Project: wikimedia/textcat Gerrit-Branch: master Gerrit-Owner: Tjones <[email protected]> _______________________________________________ MediaWiki-commits mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits
