The current ordering of JapaneseAnalyser's token filters is as follows:
    1. JapaneseBaseFormFilter
    2. JapanesePartOfSpeechStopFilter
    3. CJKWidthFilter       (similar to NormaliseFilter)
    4. StopFilter
    5. JapaneseKatakanaStemFilter
    6. LowerCaseFilter

Our existing support for English applies token filters in the following order:
    1. Various tokenisation hacks which we use to avoid having to fix
the tokeniser itself.
    2. Normalisation
    2.1 NormaliseFilter
    2.2. LowerCaseFilter
    3. StopFilter
    4. PorterStemFilter

I'm wondering a couple of things.

1) Is it {right/intentional/sane} that in JapaneseAnalyser, one
stemming filter is before the normalisation (JapaneseBaseFormFilter)
and another (JapaneseKatakanaStemFilter) is after it?

2) How much leeway do I have with changing the order? Ideally, I would
like to line up the pipeline something like this:

    1. Tokenisation
        If English, StandardTokeniser plus all our hacks.
        If Japanese, JapaneseTokeniser.

    2. Normalisation
        NormaliseFilter
        LowerCaseFilter

    3. Stop words (user can opt out of this feature)
        If Japanese, JapanesePartOfSpeechStopFilter
        StopFilter (list of stop words different per language)

    4. Stemming (user can opt out of this feature)
        If English, PorterStemFilter
        If Japanese, JapaneseBaseFormFilter
        If Japanese, JapanesePartOfSpeechStopFilter

Stop words and stemming could be swapped, but the main thing is that
the user setting dependent parts would be grouped together into a
fairly logical arrangement, instead of the method just becoming a
spaghetti mess of different option checks.

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to