Hi Chris, Thank you for your info. With CJKAnalyzer, the diagnosis are as follows:
pos start end Inc Ofst Ofst [Aa] 1 0 2 [aa] 1 1 3 [aB] 1 2 4 [BC] 1 3 5 [Cc] 1 4 6 [cD] 1 5 7 [Dd] 1 6 8 [dE] 1 7 9 [EF] 1 8 10 [FG] 1 9 11 [Gg] 1 10 12 [gH] 1 11 13 [Hh] 1 12 14 [hI] 1 13 15 [Ii] 1 14 16 [iJ] 1 15 17 [JK] 1 16 18 [Kk] 1 17 19 [kL] 1 18 20 [LM] 1 19 21 [Mm] 1 20 22 [mN] 1 21 23 <B>AaaBCcDdEFGgHhIiJKkLMmN</B> CJKAnalyzer is producing TokenStream which is all overlap Mark was pointed out. But JapaneseAnalyzer is producing a stream of tokens are not overlapped as I showed in my previous mail. BTW, I couldn't find CJKHighlighter and CJKHighlighterAnalyzer in sandbox... Koji > -----Original Message----- > From: Chris Lu [mailto:[EMAIL PROTECTED] > Sent: Tuesday, September 06, 2005 3:53 PM > To: java-user@lucene.apache.org > Subject: Re: Highlighter apply to Japanese > > > Hi, Koji, > > I had the same problem as you. This is because CJK's n-gram analysis > is different from single character's. > > My get around is to use CJKHighlighter and > CJKHighlightAnalyzer in sandbox. > > -- > Chris Lu > ------------ > Lucene Search RAD on Any Database > http://www.dbsight.net > > > On 9/5/05, Koji Sekiguchi <[EMAIL PROTECTED]> wrote: > > Hi again, > > > > I'm using highlighter to highlight terms in Japanese text, > > but I cannot get preferable output. > > > > If I use StandardAnalyzer or SnowballAnalyzer w/ English, > > getBestFragment() returns preferable outputs: > > > > Sample: (SnowballAnalyzer) > > Text: A meeting will be held in the City Hall > > TokenStream: > > [a][meet][will][be][held][in][the][citi][hall] > > Query Text: meet > > Output: A <B>meeting</B> will be held in the City Hall > > > > But if I use JapaneseAnalyzer, which is most popular Analyzer > > in Japan to get TokenStream from Japanese text, to highlight > > Japanese text with Highlighter, whole text is highlighted: > > > > Sample: (JapaneseAnalyzer) > > Text: AMeetingWillBeHeldInTheCityHall > > TokenStream: > > [A][Meeting][Will][Be][Held][In][The][City][Hall] > > Query Text: Meeting > > Output: <B>AMeetingWillBeHeldInTheCityHall</B> > > > > Please note that I use alphabet to show the Text at second sample > > because most users in this mailing list can read it, but in reality, > > I used Japanese characters for the Text. And you'll see that > > JapaneseAnalyzer, > > which uses Japanese dictionary on background to extract tokens > > from text stream, can recognize tokens and produce TokenStream. > > But highlighter.getBestFragment() highlighted whole text. > > > > Do I need to implement Fragmenter to highlight tokens correctly > > for Japanese text? > > > > Thanks in advance, > > > > Koji > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]