I admit not understanding sigram/CJK issues fully, but I trust Doug does, so I'm +0.
Otis --- Doug Cutting <[EMAIL PROTECTED]> wrote: > +1 > > I'm willing to include this patch in 1.3 final. Are there any who > see > problems with it or otherwise oppose it? > > Doug > > John McNally wrote: > > I'd certainly like to see a resolution to my sigram/cjk > > question/proposal a few days ago. It might not be a high priority > > issue, but I think if there is agreement it is a simple fix. > > > > I'm sure I'm discussing stuff that is well known in this community, > but > > will give some background just in case. There are three main ways > to > > create tokens out of text. Character, multi-character (n-gram), > and > > word. Words are generally considered the best; though for CJK > languages > > using words means using a dictionary, since delimiters such as > > whitespace are not usually used, which increases complexity quite a > bit. > > > > An n-gram index usually has better precision than a character based > > index but a much larger index size. There is a bigram analyzer > posted > > as an enhancement in bugzilla. > > > > A character based index lead to long lists for each key, but given > that > > inefficiency, they are easy to implement and have shown to be > useful for > > CJK, one can even use phrase matching to get word matches. There > was a > > patch made which uses the term sigram which I interpret to mean > > character based indexing. It, however, appears flawed. It is > treating > > all consecutive CJK characters as a token; which in the case where > there > > is no non-CJK characters in the text is the same as whole document > > matching. As this is almost the same behavior that was available > prior > > to the patch, I think I am right in thinking there is a bug. > > > > The patch could be small: > > --- StandardTokenizer.jj-orig 2003-12-19 16:56:31.000000000 -0800 > > +++ StandardTokenizer.jj 2003-12-19 16:54:43.000000000 -0800 > > @@ -125,7 +125,7 @@ > > (<LETTER>|<DIGIT>)* > > > > > > > -| < SIGRAM: (<CJK>)+ > > > +| < SIGRAM: (<CJK>) > > > | < #ALPHA: (<LETTER>)+> > > | < #LETTER: // unicode > letters > > [ > > > > > > I would think that removing SIGRAM and only using CJK as the token > would > > be better, but I don't have a setup to test these changes. > > > > Any chance this can be addressed? > > > > john mcnally > > > > > > > > On Fri, 2003-12-19 at 13:31, Doug Cutting wrote: > > > >>I'm thinking of making a 1.3 final release in the next few days. > >> > >>Any objections? > >> > >>Doug > >> > >> > >>--------------------------------------------------------------------- > >>To unsubscribe, e-mail: [EMAIL PROTECTED] > >>For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > __________________________________ Do you Yahoo!? New Yahoo! Photos - easier uploading and sharing. http://photos.yahoo.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
