Hello,
> -----Original Message-----
> From: Doug Cutting [mailto:[EMAIL PROTECTED]]
> >
> > I think IDENTIFIER_CHAR doesn't need to be the first char so my
> > proposal is:
> > <TERM: ( ~["\"", " ", "\t", "(", ")", ":", "&", "|", "^",
> "*", "?",
> > "~", "{", "}", "[", "]" ] )+ >
>
> That looks like the right approach to me.
>
> > On the other hand IDENTIFIER, ALPHA_CHAR, ALPHANUM_CHAR tokens are
> > definied but are not used.
>
> So let's remove them!
I also removed _NEWLINE, _QCHAR and _RESTOFLINE. They weren't used.
> > These changes yield the following token definitions in QueryParser.jj:
>
> <*> TOKEN : {
> <#_NUM_CHAR: ["0"-"9"] >
> | <#_TERM_CHAR: ~["\"", " ", "\t", "(", ")", ":", "&", "|",
> "^", "*", "?", "~", "{", "}", "[", "]" ] >
> | <#_NEWLINE: ( "\r\n" | "\r" | "\n" ) >
> | <#_WHITESPACE: ( " " | "\t" ) >
> | <#_QCHAR: ( "\\" (<_NEWLINE> | ~["a"-"z", "A"-"Z",
> "0"-"9"] ) ) >
> | <#_RESTOFLINE: (~["\r", "\n"])* >
> }
>
> <DEFAULT> TOKEN : {
> <AND: ("AND" | "&&") >
> | <OR: ("OR" | "||") >
> | <NOT: ("NOT" | "!") >
> | <PLUS: "+" >
> | <MINUS: "-" >
> | <LPAREN: "(" >
> | <RPAREN: ")" >
> | <COLON: ":" >
> | <CARAT: "^" >
> | <STAR: "*" >
> | <QUOTED: "\"" (~["\""])+ "\"">
> | <NUMBER: (["+","-"])? (<_NUM_CHAR>)+ "." (<_NUM_CHAR>)+ >
> | <TERM: (<_TERM_CHAR>)+ >
> | <FUZZY: "~" >
> | <WILDTERM: <_TERM_CHAR>
> ( ~["\"", " ", "\t", "(", ")", ":", "&", "|",
> "^", "~", "{",
> "}", "[", "]" ] )+ <_TERM_CHAR>>
> | <RANGEIN: "[" (~["]"])+ "]">
> | <RANGEEX: "{" (~["}"])+ "}">
> }
>
> <DEFAULT> SKIP : {
> <<_WHITESPACE>>
> }
>
> Can folks try these and tell me if it solves the problem?
>
I tried but didn't solve all problems because the generated parser can't handle _not_ ISO-LATIN1 characters! Some accented characters are definied in ISO-LATIN1 but for example two Hungarian characters are in ISO-LATIN2. (I don't know Katakana characters).
The QueryParser.jj file in the cvs uses ASCII_CharStream (Latin1):
"ASCII_CharStream generated when neither of the two options - UNICODE_INPUT or JAVA_UNICODE_ESCAPE is set.
This class treats the input as a stream of 1-byte (ISO-LATIN1) characters. Note that this class can also be used to parse binary files. It just reads a byte and returns it as a 16 bit quantity to the lexical analyzer. So any character returned by this class will be in the range '\u0000'-'\u00ff'. " (source: http://www.webgain.com/products/java_cc/charstream.html)
I prefer Unicode since the common use of QueryParser is through it's String contructor and this string is fed to a StringReader object (that can return not ascii characters). That's why I added UNICODE_INPUT=true option.
I tested with some unicode specific characters (for example Unicode 337) and I had good results.
I attached the modified jj file perhaps it can help.
> Ideally we should add some cases for this to the junit tests,
> but I can't
> get junit to work at all right now... Have the junit tests ever run
> correctly from ant since the move to Jakarta? Can someone
> more familiar
> with junit have a look at this?
>
> Doug
>
peter
-- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
QueryParser.jj