RE: Re(2): Re: [Lucene-dev] Katakana characters in queries (a bug?)

Halï¿½csy Pï¿½ter Tue, 30 Oct 2001 23:53:59 -0800

Title: RE: Re(2): Re: [Lucene-dev] Katakana characters in queries (a bug?)

Hello,

> -----Original Message-----
> From: Doug Cutting [mailto:[EMAIL PROTECTED]]
> >
> > I think IDENTIFIER_CHAR doesn't need to be the first char so my
> > proposal is:
> > <TERM: ( ~["\"", " ", "\t", "(", ")", ":", "&", "|", "^",
> "*", "?",
> > "~", "{", "}", "[", "]" ] )+ >
>
> That looks like the right approach to me.
>
> > On the other hand IDENTIFIER, ALPHA_CHAR, ALPHANUM_CHAR tokens are
> > definied but are not used.
>
> So let's remove them!
I also removed _NEWLINE, _QCHAR and _RESTOFLINE. They weren't used.

> > These changes yield the following token definitions in QueryParser.jj:
>
> <*> TOKEN : {
>   <#_NUM_CHAR:   ["0"-"9"] >
> | <#_TERM_CHAR: ~["\"", " ", "\t", "(", ")", ":", "&", "|",
>                   "^", "*", "?", "~", "{", "}", "[", "]" ] >
> | <#_NEWLINE:    ( "\r\n" | "\r" | "\n" ) >
> | <#_WHITESPACE: ( " " | "\t" ) >
> | <#_QCHAR:      ( "\\" (<_NEWLINE> | ~["a"-"z", "A"-"Z",
> "0"-"9"] ) ) >
> | <#_RESTOFLINE: (~["\r", "\n"])* >
> }
>
> <DEFAULT> TOKEN : {
>   <AND:       ("AND" | "&&") >
> | <OR:        ("OR" | "||") >
> | <NOT:       ("NOT" | "!") >
> | <PLUS:      "+" >
> | <MINUS:     "-" >
> | <LPAREN:    "(" >
> | <RPAREN:    ")" >
> | <COLON:     ":" >
> | <CARAT:     "^" >
> | <STAR:      "*" >
> | <QUOTED:     "\"" (~["\""])+ "\"">
> | <NUMBER:    (["+","-"])? (<_NUM_CHAR>)+ "." (<_NUM_CHAR>)+ >
> | <TERM:      (<_TERM_CHAR>)+ >
> | <FUZZY:     "~" >
> | <WILDTERM: <_TERM_CHAR>
>               ( ~["\"", " ", "\t", "(", ")", ":", "&", "|",
> "^", "~", "{",
> "}", "[", "]" ] )+ <_TERM_CHAR>>
> | <RANGEIN:   "[" (~["]"])+ "]">
> | <RANGEEX:   "{" (~["}"])+ "}">
> }
>
> <DEFAULT> SKIP : {
>   <<_WHITESPACE>>
> }
>
> Can folks try these and tell me if it solves the problem?
>

I tried but didn't solve all problems because the generated parser can't handle _not_ ISO-LATIN1 characters! Some accented characters are definied in ISO-LATIN1 but for example two Hungarian characters are in ISO-LATIN2. (I don't know Katakana characters).

The QueryParser.jj file in the cvs uses ASCII_CharStream (Latin1):
"ASCII_CharStream generated when neither of the two options - UNICODE_INPUT or JAVA_UNICODE_ESCAPE is set.
This class treats the input as a stream of 1-byte (ISO-LATIN1) characters. Note that this class can also be used to parse binary files. It just reads a byte and returns it as a 16 bit quantity to the lexical analyzer. So any character returned by this class will be in the range '\u0000'-'\u00ff'. " (source: http://www.webgain.com/products/java_cc/charstream.html)

I prefer Unicode since the common use of QueryParser is through it's String contructor and this string is fed to a StringReader object (that can return not ascii characters). That's why I added UNICODE_INPUT=true option.

I tested with some unicode specific characters (for example Unicode 337) and I had good results.

I attached the modified jj file perhaps it can help.

> Ideally we should add some cases for this to the junit tests,
> but I can't
> get junit to work at all right now... Have the junit tests ever run
> correctly from ant since the move to Jakarta? Can someone
> more familiar
> with junit have a look at this?
>
> Doug
>

peter

QueryParser.jj

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

RE: Re(2): Re: [Lucene-dev] Katakana characters in queries (a bug?)

Reply via email to