Thanks Dawid, that was it. I'm now using an empty stoptags set and I'm seeing all the expected tokens.
Jerome From: Dawid Weiss <[email protected]> To: [email protected], Date: 01/18/2013 02:52 PM Subject: Re: Japanese analyzer Jerome, Some of the tokens are removed because their part of speech tags are in the stoptags file? That's my guess at least -- you can always try to copy/paste Japanese analyzer and change the token stream components: protected TokenStreamComponents createComponents(String fieldName, Reader reader) { Tokenizer tokenizer = new JapaneseTokenizer(reader, userDict, true, mode); TokenStream stream = new JapaneseBaseFormFilter(tokenizer); stream = new JapanesePartOfSpeechStopFilter(true, stream, stoptags); << this is the thing I was talking about. stream = new CJKWidthFilter(stream); stream = new StopFilter(matchVersion, stream, stopwords); stream = new JapaneseKatakanaStemFilter(stream); stream = new LowerCaseFilter(matchVersion, stream); return new TokenStreamComponents(tokenizer, stream); } Dawid On Fri, Jan 18, 2013 at 2:46 PM, Jerome Lanneluc <[email protected]> wrote: > Thanks for your answer. > > No those words are not part of the stop word file (I'm using the one that > comes with the Japanese analyzer in lucene-kuromoji-3.6.1.jar. > > My Japanese contact told me that the first sentence means "I am Japanese" > and the second one is a unit of length. > > Jerome > > > > From: Swapnil Patil <[email protected]> > To: [email protected], > Date: 01/18/2013 02:33 PM > Subject: Re: Japanese analyzer > > > > Hi, > > I just translated these words, using google translate look like Japanese > I [ > Can you check if these words are in your stopword file. > if these words exits in your stop word file than you will not get them in > token stream. > > -Swapnil > > On Fri, Jan 18, 2013 at 6:58 PM, Jerome Lanneluc > <[email protected] >> wrote: > >> [私 日本人 > > > > Sauf indication contraire ci-dessus:/ Unless stated otherwise above: > Compagnie IBM France > Siège Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex > RCS Nanterre 552 118 465 > Forme Sociale : S.A.S. > Capital Social : 653.242.306,20 � > SIREN/SIRET : 552 118 465 03644 - Code NAF 6202A --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected] Sauf indication contraire ci-dessus:/ Unless stated otherwise above: Compagnie IBM France Siège Social : 17 avenue de l'Europe, 92275 Bois-Colombes Cedex RCS Nanterre 552 118 465 Forme Sociale : S.A.S. Capital Social : 653.242.306,20 � SIREN/SIRET : 552 118 465 03644 - Code NAF 6202A
