DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT <http://issues.apache.org/bugzilla/show_bug.cgi?id=28827>. ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=28827 QueryParser treats CJK and English query strings differently Summary: QueryParser treats CJK and English query strings differently Product: Lucene Version: unspecified Platform: PC OS/Version: Windows NT/2K Status: NEW Severity: Major Priority: Other Component: QueryParser AssignedTo: [EMAIL PROTECTED] ReportedBy: [EMAIL PROTECTED] Since 1.3 final, the Standard Analyzer returns strings of CJK characters as separate tokens. However, the generated QueryParser has its own grammer which doesn't take account of this. So we get the following behaviour: parse("one two three", "content", new StandardAnalyzer()) returns 'content:one content:two content:three', searching for each term individually. parse("\"one two three\"", "content", new StandardAnalyzer()) returns 'content:"one two three"', searching for the phrase. parse("C1C2C3", "content", new StandardAnalyzer()) where Cn is a Chinese character returns 'content:"C1 C2 C3"', when it should really be 'content:C1 content:C2 content:C3'. This is inconsistent. parse("\"C1C2C3\"", "content", new StandardAnalyzer()) also returns 'content:"C1 C2 C3"', identical to the previous case. Although the string is separated out into the separate CJK tokens (indicated by the spaces between them), the query parser builds a phrase search for them rather than individual token searches. To get the desired query the user has to instead enter "C1 C2 C3" as the query string (or I have to pre-process the query string in my code to add the spaces), which is non-intuitive. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]