Hi Chris,

Thank you for your info.
With CJKAnalyzer, the diagnosis are as follows:

        pos     start   end
        Inc     Ofst    Ofst
[Aa]    1       0       2
[aa]    1       1       3
[aB]    1       2       4
[BC]    1       3       5
[Cc]    1       4       6
[cD]    1       5       7
[Dd]    1       6       8
[dE]    1       7       9
[EF]    1       8       10
[FG]    1       9       11
[Gg]    1       10      12
[gH]    1       11      13
[Hh]    1       12      14
[hI]    1       13      15
[Ii]    1       14      16
[iJ]    1       15      17
[JK]    1       16      18
[Kk]    1       17      19
[kL]    1       18      20
[LM]    1       19      21
[Mm]    1       20      22
[mN]    1       21      23

<B>AaaBCcDdEFGgHhIiJKkLMmN</B>

CJKAnalyzer is producing TokenStream which is all overlap
Mark was pointed out.
But JapaneseAnalyzer is producing a stream of tokens
are not overlapped as I showed in my previous mail.

BTW, I couldn't find CJKHighlighter and CJKHighlighterAnalyzer in
sandbox...

Koji

> -----Original Message-----
> From: Chris Lu [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, September 06, 2005 3:53 PM
> To: java-user@lucene.apache.org
> Subject: Re: Highlighter apply to Japanese
> 
> 
> Hi, Koji,
> 
> I had the same problem as you. This is because CJK's n-gram analysis
> is different from single character's.
> 
> My get around is to use CJKHighlighter and 
> CJKHighlightAnalyzer in sandbox.
> 
> -- 
> Chris Lu
> ------------
> Lucene Search RAD on Any Database
> http://www.dbsight.net
> 
> 
> On 9/5/05, Koji Sekiguchi <[EMAIL PROTECTED]> wrote:
> > Hi again,
> > 
> > I'm using highlighter to highlight terms in Japanese text,
> > but I cannot get preferable output.
> > 
> > If I use StandardAnalyzer or SnowballAnalyzer w/ English,
> > getBestFragment() returns preferable outputs:
> > 
> > Sample: (SnowballAnalyzer)
> > Text: A meeting will be held in the City Hall
> > TokenStream:
> > [a][meet][will][be][held][in][the][citi][hall]
> > Query Text: meet
> > Output: A <B>meeting</B> will be held in the City Hall
> > 
> > But if I use JapaneseAnalyzer, which is most popular Analyzer
> > in Japan to get TokenStream from Japanese text, to highlight
> > Japanese text with Highlighter, whole text is highlighted:
> > 
> > Sample: (JapaneseAnalyzer)
> > Text: AMeetingWillBeHeldInTheCityHall
> > TokenStream:
> > [A][Meeting][Will][Be][Held][In][The][City][Hall]
> > Query Text: Meeting
> > Output: <B>AMeetingWillBeHeldInTheCityHall</B>
> > 
> > Please note that I use alphabet to show the Text at second sample
> > because most users in this mailing list can read it, but in reality,
> > I used Japanese characters for the Text. And you'll see that
> > JapaneseAnalyzer,
> > which uses Japanese dictionary on background to extract tokens
> > from text stream, can recognize tokens and produce TokenStream.
> > But highlighter.getBestFragment() highlighted whole text.
> > 
> > Do I need to implement Fragmenter to highlight tokens correctly
> > for Japanese text?
> > 
> > Thanks in advance,
> > 
> > Koji
> > 
> > 
> > 
> > 
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to