The current ordering of JapaneseAnalyser's token filters is as follows: 1. JapaneseBaseFormFilter 2. JapanesePartOfSpeechStopFilter 3. CJKWidthFilter (similar to NormaliseFilter) 4. StopFilter 5. JapaneseKatakanaStemFilter 6. LowerCaseFilter
Our existing support for English applies token filters in the following order: 1. Various tokenisation hacks which we use to avoid having to fix the tokeniser itself. 2. Normalisation 2.1 NormaliseFilter 2.2. LowerCaseFilter 3. StopFilter 4. PorterStemFilter I'm wondering a couple of things. 1) Is it {right/intentional/sane} that in JapaneseAnalyser, one stemming filter is before the normalisation (JapaneseBaseFormFilter) and another (JapaneseKatakanaStemFilter) is after it? 2) How much leeway do I have with changing the order? Ideally, I would like to line up the pipeline something like this: 1. Tokenisation If English, StandardTokeniser plus all our hacks. If Japanese, JapaneseTokeniser. 2. Normalisation NormaliseFilter LowerCaseFilter 3. Stop words (user can opt out of this feature) If Japanese, JapanesePartOfSpeechStopFilter StopFilter (list of stop words different per language) 4. Stemming (user can opt out of this feature) If English, PorterStemFilter If Japanese, JapaneseBaseFormFilter If Japanese, JapanesePartOfSpeechStopFilter Stop words and stemming could be swapped, but the main thing is that the user setting dependent parts would be grouped together into a fairly logical arrangement, instead of the method just becoming a spaghetti mess of different option checks. Daniel --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org