QueryParser should ignore double-quotes if mid-word
---------------------------------------------------
Key: LUCENE-2465
URL: https://issues.apache.org/jira/browse/LUCENE-2465
Project: Lucene - Java
Issue Type: Bug
Components: QueryParser
Affects Versions: 3.0.1, 3.0, 2.9.2, 2.9.1, 2.9, 2.4.1, 2.4, 2.3.2, 2.3.1,
2.3, 2.2, 2.1, 2.0.0, 1.9, 2.3.3, 2.4.2, 2.9.3, Flex Branch, 3.0.2, 3.1, 4.0
Reporter: Itamar Syn-Hershko
Current implementation of Lucene's QueryParser identifies a phrase in the query
when hitting a double-quotes char, even if it is mid-word. For example, the
string ' Foo"bar test" ' will produce a BooleanQuery, holding one term and one
PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase is a group
of words surrounded by double quotes as defined by
http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does
it say double-quotes will also tokenize the input. Arguably, a phrase should
only be identified as such when it is also surrounded by whitespaces.
Other than a logically incorrect behavior, this makes parsing of Hebrew
acronyms impossible. Hebrew acronyms contain one double-quotes char in the
middle of a word (for example, MNK"L), hence causing the QP to throw a syntax
exception, since it is expecting another double-quotes to create a phrase
query, essentially splitting the acronym into two.
The solution to this is pretty simple - changing the JavaCC syntax to check if
a whitespace precedes the double-quote when a phrase opening is expected, or
peek to see if a whitespace follows the double-quotes if a phrase closing is
expected.
This will both eliminate a logically incorrect behavior which shouldn't be
relied on anyway, and allow Hebrew queries to be correctly parsed also when
containing acronyms.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]