Tjones has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/365251 )

Change subject: Configure Japanese Language Analysis with Kuromoji
......................................................................

Configure Japanese Language Analysis with Kuromoji

With community input, it was decided that the Kuromoji language analyzer
should not be deployed. However, if it ever were deployed, this is the
baseline configuration that I would recommend.

It fixes problems Kuromoji has:
 - inconsistent treatment of fullwidth numbers
 - many non-Japanese, non-Latin words are not indexed

Incidentally re-format italian_elision not to take up so much vertical
space.

Bug: T166731
Change-Id: I133cdc9affa3ed308a46a87892e069cd7461848e
---
M includes/Maintenance/AnalysisConfigBuilder.php
1 file changed, 32 insertions(+), 22 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/CirrusSearch 
refs/changes/51/365251/1

diff --git a/includes/Maintenance/AnalysisConfigBuilder.php 
b/includes/Maintenance/AnalysisConfigBuilder.php
index 86bbf83..43519bb 100644
--- a/includes/Maintenance/AnalysisConfigBuilder.php
+++ b/includes/Maintenance/AnalysisConfigBuilder.php
@@ -716,27 +716,9 @@
                        $config[ 'filter' ][ 'italian_elision' ] = [
                                'type' => 'elision',
                                'articles' => [
-                                       'c',
-                                       'l',
-                                       'all',
-                                       'dall',
-                                       'dell',
-                                       'nell',
-                                       'sull',
-                                       'coll',
-                                       'pell',
-                                       'gl',
-                                       'agl',
-                                       'dagl',
-                                       'degl',
-                                       'negl',
-                                       'sugl',
-                                       'un',
-                                       'm',
-                                       't',
-                                       's',
-                                       'v',
-                                       'd'
+                                       'c', 'l', 'all', 'dall', 'dell', 
'nell', 'sull',
+                                       'coll', 'pell', 'gl', 'agl', 'dagl', 
'degl', 'negl',
+                                       'sugl', 'un', 'm', 't', 's', 'v', 'd'
                                ],
                        ];
                        $config[ 'filter' ][ 'italian_stop' ] = [
@@ -768,6 +750,34 @@
                        $config[ 'analyzer' ][ 'lowercase_keyword' ][ 'filter' 
][] = 'asciifolding_preserve';
 
                        // In Italian text_search is just a copy of text
+                       $config[ 'analyzer' ][ 'text_search' ] = $config[ 
'analyzer' ][ 'text' ];
+                       break;
+               case 'japanese':
+                       // See 
https://www.mediawiki.org/wiki/User:TJones_(WMF)/T166731
+                       $config[ 'char_filter' ][ 'fullwidthnumfix' ] = [
+                               // pre-convert fullwidth numbers because 
Kuromoji tokenizer treats them weirdly
+                               'type' => 'mapping',
+                               'mappings' => [
+                                       "\uff10=>0", "\uff11=>1", "\uff12=>2", 
"\uff13=>3",
+                                       "\uff14=>4", "\uff15=>5", "\uff16=>6", 
"\uff17=>7",
+                                       "\uff18=>8", "\uff19=>9",
+                               ],
+                       ];
+
+                       $config[ 'analyzer' ][ 'text' ] = [
+                               'type' => 'custom',
+                               'char_filter' => [ 'fullwidthnumfix' ],
+                               'tokenizer' => 'kuromoji_tokenizer',
+                       ];
+
+                       $filters = [];
+                       $filters[] = 'kuromoji_baseform';
+                       $filters[] = 'cjk_width';
+                       $filters[] = 'ja_stop';
+                       $filters[] = 'kuromoji_stemmer';
+                       $filters[] = 'lowercase';
+                       $config[ 'analyzer' ][ 'text' ][ 'filter' ] = $filters;
+
                        $config[ 'analyzer' ][ 'text_search' ] = $config[ 
'analyzer' ][ 'text' ];
                        break;
                case 'russian':
@@ -1050,7 +1060,7 @@
                // For Hebrew, see 
https://www.mediawiki.org/wiki/User:TJones_(WMF)/T162741
 
                'analysis-stempel' => [ 'pl' => 'polish' ],
-               'analysis-kuromoji' => [ 'ja' => 'kuromoji' ],
+               'analysis-kuromoji' => [ 'ja' => 'japanese' ],
                'analysis-stconvert,analysis-smartcn' => [ 'zh' => 'chinese' ],
                'analysis-hebrew' => [ 'he' => 'hebrew' ],
                'analysis-ukrainian' => [ 'uk' => 'ukrainian' ],

-- 
To view, visit https://gerrit.wikimedia.org/r/365251
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I133cdc9affa3ed308a46a87892e069cd7461848e
Gerrit-PatchSet: 1
Gerrit-Project: mediawiki/extensions/CirrusSearch
Gerrit-Branch: master
Gerrit-Owner: Tjones <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to