[
https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868000#action_12868000
]
Shai Erera commented on LUCENE-2465:
------------------------------------
Actually, I think this is a bug, or at least an inconsistent behavior. QP
returns c:foo+bar for the query "foo+bar" (no quotes), when the default field
is "c". Yet for "foo\"bar" it throws an exception ...
We handle such quotes properly, following rules like Mike suggests. We don't
declare invalid characters, but rather "valid syntax". So a \" starts a phrase
only if it follows whitespace, +/-, field:, ( etc. In all other cases, it is
just returned w/ the word/token. If the app does anything with it, then good -
otherwise it won't find matches, and one can decide if that was a mistake in
the query, or perhaps there isn't such token (like in Hebrew when \" are
permitted mid-words).
Also, GERSHAYIM is simply not a valid argument - users cannot type Unicode,
they type text. It's like asking one to differentiate ' and ` - they are
visually the same ... and if a character cannot be typed ...
So I think this should be fixed, and the above test case be added to QP.
> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
> Key: LUCENE-2465
> URL: https://issues.apache.org/jira/browse/LUCENE-2465
> Project: Lucene - Java
> Issue Type: Bug
> Components: QueryParser
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4,
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1,
> 4.0
> Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the
> query when hitting a double-quotes char, even if it is mid-word. For example,
> the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term
> and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase
> is a group of words surrounded by double quotes as defined by
> http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does
> it say double-quotes will also tokenize the input. Arguably, a phrase should
> only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew
> acronyms impossible. Hebrew acronyms contain one double-quotes char in the
> middle of a word (for example, MNK"L), hence causing the QP to throw a syntax
> exception, since it is expecting another double-quotes to create a phrase
> query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check
> if a whitespace precedes the double-quote when a phrase opening is expected,
> or peek to see if a whitespace follows the double-quotes if a phrase closing
> is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be
> relied on anyway, and allow Hebrew queries to be correctly parsed also when
> containing acronyms.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]