RE: problem with non latin characters in the query

Eric Isakson Mon, 04 Nov 2002 11:17:39 -0800

Olivier,

I'm no expert on this by any means, but I poked around in the sources this morning 
trying to understand where this problem may be occurring as I'm trying to get familiar 
with any internationalization problems I'm going to run into with Lucene. This message 
rambles on a bit, but follows my train of thought as I looked at this problem.


It doesn't look to me like the analyzer will have anything to do with it. The problem 
occurrs somewhere inside org.apache.lucene.queryParser.QueryParser.jj lexical analysis 
so I looked there to get a better understanding of how that works. 

To process your query, QueryParser reads your query using a StringReader. So, my first 
question to you is, did the query make it from your UTF-16 query file and get 
transcoded properly to the java string (for instance using an InputStreamReader with a 
constructor that set the encoding to UTF-16 or some String constructor where you 
supplied a byte[] and charset).

Assuming that part was handled properly, we next need to look at the query parser's 
grammar for problems, perhaps the character you wish to use is not part of the grammar 
for a token and as for the start characters giving the same error, we should look here 
too:

<*> TOKEN : {
  <#_NUM_CHAR:   ["0"-"9"] >
| <#_ESCAPED_CHAR: "\\" [ "\\", "+", "-", "!", "(", ")", ":", "^",
                          "[", "]", "\"", "{", "}", "~", "*", "?" ] >
| <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":", "^",
                           "[", "]", "\"", "{", "}", "~", "*", "?" ]
                       | <_ESCAPED_CHAR> ) >
| <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> ) >
| <#_WHITESPACE: ( " " | "\t" ) >
}

<DEFAULT, RangeIn, RangeEx> SKIP : {
  <<_WHITESPACE>>
}

<DEFAULT> TOKEN : {
  <AND:       ("AND" | "&&") >
| <OR:        ("OR" | "||") >
| <NOT:       ("NOT" | "!") >
| <PLUS:      "+" >
| <MINUS:     "-" >
| <LPAREN:    "(" >
| <RPAREN:    ")" >
| <COLON:     ":" >
| <CARAT:     "^" > : Boost
| <QUOTED:     "\"" (~["\""])+ "\"">
| <TERM:      <_TERM_START_CHAR> (<_TERM_CHAR>)*  >
| <FUZZY:     "~" >
| <SLOP:      "~" (<_NUM_CHAR>)+ >
| <PREFIXTERM:  <_TERM_START_CHAR> (<_TERM_CHAR>)* "*" >
| <WILDTERM:  <_TERM_START_CHAR>
              (<_TERM_CHAR> | ( [ "*", "?" ] ))* >
| <RANGEIN_START: "[" > : RangeIn
| <RANGEEX_START: "{" > : RangeEx
}

... there is a bit more to this, see the org.apache.lucene.queryParser.QueryParser.jj 
file for the rest of the details ...

So, TERM is a _TERM_START_CHAR followed optionally by a series of _TERM_CHAR
_TERM_CHAR is either _TERM_START_CHAR or _ESCAPED_CHAR
and _TERM_START_CHAR is the compliment of several significant query characters or an 
_ESCAPED_CHAR.
Hmm...since _TERM_START_CHAR includes _ESCAPED_CHAR, why do we need a separate 
definition of _TERM_START_CHAR and _TERM_CHAR?

Assuming your string was read with the right character set, this leaves me wondering 
if the compliment operator in the JavaCC grammar did the right thing or maybe it is 
still something with the reader. Note also that the QUOTED production occurs before 
TERM but also uses the compliment operator, so you may be able to work around the term 
start problem you were having if you quote your terms. Just guessing here, I'm new to 
javaCC.

The QueryParser.jj file sets a few javacc options and the javacc task in build.xml has 
the opportunity to set others but doesn't:

from QueryParser.jj:
options {
  STATIC=false;
  JAVA_UNICODE_ESCAPE=true;
  USER_CHAR_STREAM=true;
}

from build.xml:
    <javacc
      target="${src.dir}/org/apache/lucene/queryParser/QueryParser.jj"
      javacchome="${javacc.zip.dir}"
      outputdirectory="${build.src}/org/apache/lucene/queryParser"
    />

javacc task has an optional parameter unicodeinput 
(http://jakarta.apache.org/ant/manual/OptionalTasks/javacc.html) that I got curious 
about, so I went and read the doc on the javacc options (note this is from the webgain 
site and is not the version used by lucene, though I'd expect these options to have 
the same behavior) http://www.webgain.com/products/java_cc/javaccgrm.html#prod2 states:

STATIC: This is a boolean option whose default value is true. If true, all methods and 
class variables are specified as static in the generated parser and token manager. 
This allows only one parser object to be present, but it improves the performance of 
the parser. To perform multiple parses during one run of your Java program, you will 
have to call the ReInit() method to reinitialize your parser if it is static. If the 
parser is non-static, you may use the "new" operator to construct as many parsers as 
you wish. These can all be used simultaneously from different threads. 

...

JAVA_UNICODE_ESCAPE: This is a boolean option whose default value is false. When set 
to true, the generated parser uses an input stream object that processes Java Unicode 
escapes (\u...) before sending characters to the token manager. By default, Java 
Unicode escapes are not processed. 
This option is ignored if either of options USER_TOKEN_MANAGER, USER_CHAR_STREAM is 
set to true. 

UNICODE_INPUT: This is a boolean option whose default value is false. When set to 
true, the generated parser uses uses an input stream object that reads Unicode files. 
By default, ASCII files are assumed. 
This option is ignored if either of options USER_TOKEN_MANAGER, USER_CHAR_STREAM is 
set to true. 

...

USER_CHAR_STREAM: This is a boolean option whose default value is false. The default 
action is to generate a character stream reader as specified by the options 
JAVA_UNICODE_ESCAPE and UNICODE_INPUT. The generated token manager receives characters 
from this stream reader. If this option is set to true, then the token manager is 
generated to read characters from any character stream reader of type 
"CharStream.java". This file is generated into the generated parser directory. 
This option is ignored if USER_TOKEN_MANAGER is set to true. 

So, the JAVA_UNICODE_ESCAPE option that is set is ignored assuming the javacc 2.0 and 
2.1 behavior is the same (on a side note, this option seems to me to imply that a java 
unicode escape sequence in the input would be read as a single character by a token 
manager, is that what it means? and is that what we really want? Seems to me, that 
UNICODE_INPUT is what we should be setting to true) and we have a USER_CHAR_STREAM 
true, which I think is where the FastCharStream that is implemented in Lucene comes 
into play...

  public Query parse(String query) throws ParseException, TokenMgrError {
    ReInit(new FastCharStream(new StringReader(query)));
    return Query(field);
  }

Nothing in there seems to point at a potential problem. So I started looking at the 
generated token manager. Here is where I got very lost...hard to follow the generated 
code since I'm not familiar with how JavaCC works in general :(

It uses a bunch of switch statements accross several methods on the current character 
in order to parse out the tokens, I wasn't really able to follow it closely, and 
figured I stop here and wait on your response to the first part about transcoding. If 
you had done that, I was wondering if anyone else might shed some light on this. Just 
wondering if the UNICODE_INPUT option might make JavaCC's compliment for 
_TERM_START_CHAR match the characters outside of ASCII in the token manager (it might 
be doing this already, I'm just not savy enough to realize it).

Anyway, thats the end of my rambling for now...even if I'm off the mark, hope it was 
useful to hear.

Eric
--
Eric D. Isakson        SAS Institute Inc.
Application Developer  SAS Campus Drive
XML Technologies       Cary, NC 27513
(919) 531-3639         http://www.sas.com


-----Original Message-----
From: PERRIN GOURON Olivier [mailto:olivier.perrin@;xml-ais.com]
Sent: Monday, November 04, 2002 9:34 AM
To: '[EMAIL PROTECTED]'
Subject: problem with non latin characters in the query



Hello,

I am using Lucene to index UTF-16 and UTF-8 files . Those files are
trans-encoded to the right format so that they can be indexable with Lucene.
The index is searched through with queries made from an UTF-16 file.
Everything works fine as long my query file contains latin characters (even
specific french chars such as ����oe...)
Problems occur when the UTF-16 query file  contains not latin characters. I
have tried russian characters, such as ?, which is \u0418, but Lucene sends
me this error:

        Exception in thread "main"
org.apache.lucene.queryParser.TokenMgrError: Lexical
        error at line 1, column 8.  Encountered: "\u0018" (24), after : ""
        at
org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(Unknown
Source)
        at org.apache.lucene.queryParser.QueryParser.jj_scan_token(Unknown
Source)
        at org.apache.lucene.queryParser.QueryParser.jj_3_1(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.jj_2_1(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.Clause(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.Query(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
        at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
        at org.apache.lucene.CherchLeTex.main(CherchLeTex.java:51)

Seems that the queryParser doesn't use the right code for ?... I have tried
greek, and I does the same.
Is it due to the analyser? I don't think so since I changed my
StandardAnalyser for the FrenchAnalyser, and it still behaves the same
to the query parser?...

Another problem that gives exactly the same error message occurs when a
world in my query starts whith a local character (����oe...). This is weird,
since local characters do not trigger errors when they are in the middle of
the world.

Have you ever met this problem? I would appreciate your help and advices

Thanks for your consideration

Olivier Perrin-Gouron
AIS Berger-Levrault


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@;jakarta.apache.org>

RE: problem with non latin characters in the query

Reply via email to