Olivier,
I'm no expert on this by any means, but I poked around in the sources this morning
trying to understand where this problem may be occurring as I'm trying to get familiar
with any internationalization problems I'm going to run into with Lucene. This message
rambles on a bit, but follows my train of thought as I looked at this problem.
It doesn't look to me like the analyzer will have anything to do with it. The problem
occurrs somewhere inside org.apache.lucene.queryParser.QueryParser.jj lexical analysis
so I looked there to get a better understanding of how that works.
To process your query, QueryParser reads your query using a StringReader. So, my first
question to you is, did the query make it from your UTF-16 query file and get
transcoded properly to the java string (for instance using an InputStreamReader with a
constructor that set the encoding to UTF-16 or some String constructor where you
supplied a byte[] and charset).
Assuming that part was handled properly, we next need to look at the query parser's
grammar for problems, perhaps the character you wish to use is not part of the grammar
for a token and as for the start characters giving the same error, we should look here
too:
<*> TOKEN : {
<#_NUM_CHAR: ["0"-"9"] >
| <#_ESCAPED_CHAR: "\\" [ "\\", "+", "-", "!", "(", ")", ":", "^",
"[", "]", "\"", "{", "}", "~", "*", "?" ] >
| <#_TERM_START_CHAR: ( ~[ " ", "\t", "+", "-", "!", "(", ")", ":", "^",
"[", "]", "\"", "{", "}", "~", "*", "?" ]
| <_ESCAPED_CHAR> ) >
| <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> ) >
| <#_WHITESPACE: ( " " | "\t" ) >
}
<DEFAULT, RangeIn, RangeEx> SKIP : {
<<_WHITESPACE>>
}
<DEFAULT> TOKEN : {
<AND: ("AND" | "&&") >
| <OR: ("OR" | "||") >
| <NOT: ("NOT" | "!") >
| <PLUS: "+" >
| <MINUS: "-" >
| <LPAREN: "(" >
| <RPAREN: ")" >
| <COLON: ":" >
| <CARAT: "^" > : Boost
| <QUOTED: "\"" (~["\""])+ "\"">
| <TERM: <_TERM_START_CHAR> (<_TERM_CHAR>)* >
| <FUZZY: "~" >
| <SLOP: "~" (<_NUM_CHAR>)+ >
| <PREFIXTERM: <_TERM_START_CHAR> (<_TERM_CHAR>)* "*" >
| <WILDTERM: <_TERM_START_CHAR>
(<_TERM_CHAR> | ( [ "*", "?" ] ))* >
| <RANGEIN_START: "[" > : RangeIn
| <RANGEEX_START: "{" > : RangeEx
}
... there is a bit more to this, see the org.apache.lucene.queryParser.QueryParser.jj
file for the rest of the details ...
So, TERM is a _TERM_START_CHAR followed optionally by a series of _TERM_CHAR
_TERM_CHAR is either _TERM_START_CHAR or _ESCAPED_CHAR
and _TERM_START_CHAR is the compliment of several significant query characters or an
_ESCAPED_CHAR.
Hmm...since _TERM_START_CHAR includes _ESCAPED_CHAR, why do we need a separate
definition of _TERM_START_CHAR and _TERM_CHAR?
Assuming your string was read with the right character set, this leaves me wondering
if the compliment operator in the JavaCC grammar did the right thing or maybe it is
still something with the reader. Note also that the QUOTED production occurs before
TERM but also uses the compliment operator, so you may be able to work around the term
start problem you were having if you quote your terms. Just guessing here, I'm new to
javaCC.
The QueryParser.jj file sets a few javacc options and the javacc task in build.xml has
the opportunity to set others but doesn't:
from QueryParser.jj:
options {
STATIC=false;
JAVA_UNICODE_ESCAPE=true;
USER_CHAR_STREAM=true;
}
from build.xml:
<javacc
target="${src.dir}/org/apache/lucene/queryParser/QueryParser.jj"
javacchome="${javacc.zip.dir}"
outputdirectory="${build.src}/org/apache/lucene/queryParser"
/>
javacc task has an optional parameter unicodeinput
(http://jakarta.apache.org/ant/manual/OptionalTasks/javacc.html) that I got curious
about, so I went and read the doc on the javacc options (note this is from the webgain
site and is not the version used by lucene, though I'd expect these options to have
the same behavior) http://www.webgain.com/products/java_cc/javaccgrm.html#prod2 states:
STATIC: This is a boolean option whose default value is true. If true, all methods and
class variables are specified as static in the generated parser and token manager.
This allows only one parser object to be present, but it improves the performance of
the parser. To perform multiple parses during one run of your Java program, you will
have to call the ReInit() method to reinitialize your parser if it is static. If the
parser is non-static, you may use the "new" operator to construct as many parsers as
you wish. These can all be used simultaneously from different threads.
...
JAVA_UNICODE_ESCAPE: This is a boolean option whose default value is false. When set
to true, the generated parser uses an input stream object that processes Java Unicode
escapes (\u...) before sending characters to the token manager. By default, Java
Unicode escapes are not processed.
This option is ignored if either of options USER_TOKEN_MANAGER, USER_CHAR_STREAM is
set to true.
UNICODE_INPUT: This is a boolean option whose default value is false. When set to
true, the generated parser uses uses an input stream object that reads Unicode files.
By default, ASCII files are assumed.
This option is ignored if either of options USER_TOKEN_MANAGER, USER_CHAR_STREAM is
set to true.
...
USER_CHAR_STREAM: This is a boolean option whose default value is false. The default
action is to generate a character stream reader as specified by the options
JAVA_UNICODE_ESCAPE and UNICODE_INPUT. The generated token manager receives characters
from this stream reader. If this option is set to true, then the token manager is
generated to read characters from any character stream reader of type
"CharStream.java". This file is generated into the generated parser directory.
This option is ignored if USER_TOKEN_MANAGER is set to true.
So, the JAVA_UNICODE_ESCAPE option that is set is ignored assuming the javacc 2.0 and
2.1 behavior is the same (on a side note, this option seems to me to imply that a java
unicode escape sequence in the input would be read as a single character by a token
manager, is that what it means? and is that what we really want? Seems to me, that
UNICODE_INPUT is what we should be setting to true) and we have a USER_CHAR_STREAM
true, which I think is where the FastCharStream that is implemented in Lucene comes
into play...
public Query parse(String query) throws ParseException, TokenMgrError {
ReInit(new FastCharStream(new StringReader(query)));
return Query(field);
}
Nothing in there seems to point at a potential problem. So I started looking at the
generated token manager. Here is where I got very lost...hard to follow the generated
code since I'm not familiar with how JavaCC works in general :(
It uses a bunch of switch statements accross several methods on the current character
in order to parse out the tokens, I wasn't really able to follow it closely, and
figured I stop here and wait on your response to the first part about transcoding. If
you had done that, I was wondering if anyone else might shed some light on this. Just
wondering if the UNICODE_INPUT option might make JavaCC's compliment for
_TERM_START_CHAR match the characters outside of ASCII in the token manager (it might
be doing this already, I'm just not savy enough to realize it).
Anyway, thats the end of my rambling for now...even if I'm off the mark, hope it was
useful to hear.
Eric
--
Eric D. Isakson SAS Institute Inc.
Application Developer SAS Campus Drive
XML Technologies Cary, NC 27513
(919) 531-3639 http://www.sas.com
-----Original Message-----
From: PERRIN GOURON Olivier [mailto:olivier.perrin@;xml-ais.com]
Sent: Monday, November 04, 2002 9:34 AM
To: '[EMAIL PROTECTED]'
Subject: problem with non latin characters in the query
Hello,
I am using Lucene to index UTF-16 and UTF-8 files . Those files are
trans-encoded to the right format so that they can be indexable with Lucene.
The index is searched through with queries made from an UTF-16 file.
Everything works fine as long my query file contains latin characters (even
specific french chars such as ����oe...)
Problems occur when the UTF-16 query file contains not latin characters. I
have tried russian characters, such as ?, which is \u0418, but Lucene sends
me this error:
Exception in thread "main"
org.apache.lucene.queryParser.TokenMgrError: Lexical
error at line 1, column 8. Encountered: "\u0018" (24), after : ""
at
org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(Unknown
Source)
at org.apache.lucene.queryParser.QueryParser.jj_scan_token(Unknown
Source)
at org.apache.lucene.queryParser.QueryParser.jj_3_1(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.jj_2_1(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.Clause(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.Query(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
at org.apache.lucene.CherchLeTex.main(CherchLeTex.java:51)
Seems that the queryParser doesn't use the right code for ?... I have tried
greek, and I does the same.
Is it due to the analyser? I don't think so since I changed my
StandardAnalyser for the FrenchAnalyser, and it still behaves the same
to the query parser?...
Another problem that gives exactly the same error message occurs when a
world in my query starts whith a local character (����oe...). This is weird,
since local characters do not trigger errors when they are in the middle of
the world.
Have you ever met this problem? I would appreciate your help and advices
Thanks for your consideration
Olivier Perrin-Gouron
AIS Berger-Levrault
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@;jakarta.apache.org>