Re: Japanese analyzer

Jerome Lanneluc Fri, 18 Jan 2013 07:18:04 -0800

Thanks Dawid, that was it. I'm now using an empty stoptags set and I'm 
seeing all the expected tokens.


Jerome



From:   Dawid Weiss <[email protected]>
To:     [email protected], 
Date:   01/18/2013 02:52 PM
Subject:        Re: Japanese analyzer



Jerome,

Some of the tokens are removed because their part of speech tags are
in the stoptags file? That's my guess at least -- you can always try
to copy/paste Japanese analyzer and change the token stream
components:

  protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
    Tokenizer tokenizer = new JapaneseTokenizer(reader, userDict, true, 
mode);
    TokenStream stream = new JapaneseBaseFormFilter(tokenizer);
    stream = new JapanesePartOfSpeechStopFilter(true, stream,
stoptags);    << this is the thing I was talking about.
    stream = new CJKWidthFilter(stream);
    stream = new StopFilter(matchVersion, stream, stopwords);
    stream = new JapaneseKatakanaStemFilter(stream);
    stream = new LowerCaseFilter(matchVersion, stream);
    return new TokenStreamComponents(tokenizer, stream);
  }

Dawid

On Fri, Jan 18, 2013 at 2:46 PM, Jerome Lanneluc
<[email protected]> wrote:
> Thanks for your answer.
>
> No those words are not part of the stop word file (I'm using the one 
that
> comes with the Japanese analyzer in lucene-kuromoji-3.6.1.jar.
>
> My Japanese contact told me that the first sentence means "I am 
Japanese"
> and the second one is a unit of length.
>
> Jerome
>
>
>
> From:   Swapnil Patil <[email protected]>
> To:     [email protected],
> Date:   01/18/2013 02:33 PM
> Subject:        Re: Japanese analyzer
>
>
>
> Hi,
>
> I just translated these words, using google translate look like Japanese
> I [
> Can you check if these words are  in your stopword file.
> if these words exits in your stop word file than you will not get them 
in
> token stream.
>
> -Swapnil
>
> On Fri, Jan 18, 2013 at 6:58 PM, Jerome Lanneluc
> <[email protected]
>> wrote:
>
>> [私 日本人
>
>
>
> Sauf indication contraire ci-dessus:/ Unless stated otherwise above:
> Compagnie IBM France
> Siège Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex
> RCS Nanterre 552 118 465
> Forme Sociale : S.A.S.
> Capital Social : 653.242.306,20 �
> SIREN/SIRET : 552 118 465 03644 - Code NAF 6202A

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]




Sauf indication contraire ci-dessus:/ Unless stated otherwise above:
Compagnie IBM France
Siège Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex
RCS Nanterre 552 118 465
Forme Sociale : S.A.S.
Capital Social : 653.242.306,20 �
SIREN/SIRET : 552 118 465 03644 - Code NAF 6202A

Re: Japanese analyzer

Reply via email to