[
https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868017#action_12868017
]
Robert Muir commented on LUCENE-2465:
-------------------------------------
{quote}
It's not that I disagree with what you say Robert, but I think we're arguing
two different things. Please correct me if I'm wrong, but ঃ does not denote a
field:value delimiter by the QueryParser, right? I've tried the following query
"fooঃbar" and it was still parsed to c:fooঃbar.
{quote}
Yes you are wrong, the problem is that often the colon : is substituted for
this character. So must we change the queryparser syntax to try to disambiguate
when : is really visarga, versus when : is a field name? No we shouldnt, just
like we shouldnt change the query parser to try to disambiguate when " is
really gershayim.
Its not just Hebrew and Bengali either, the problem exists in other languages,
if you try you can probably find some natural use of a queryparser operator in
some language. Its just an example to show that the problem is not unique to
Hebrew, and that the disambiguation/charset conversion doesn't belong in the
queryparser, but instead is up to you.
> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
> Key: LUCENE-2465
> URL: https://issues.apache.org/jira/browse/LUCENE-2465
> Project: Lucene - Java
> Issue Type: Bug
> Components: QueryParser
> Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4,
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1,
> 4.0
> Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the
> query when hitting a double-quotes char, even if it is mid-word. For example,
> the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term
> and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase
> is a group of words surrounded by double quotes as defined by
> http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does
> it say double-quotes will also tokenize the input. Arguably, a phrase should
> only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew
> acronyms impossible. Hebrew acronyms contain one double-quotes char in the
> middle of a word (for example, MNK"L), hence causing the QP to throw a syntax
> exception, since it is expecting another double-quotes to create a phrase
> query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check
> if a whitespace precedes the double-quote when a phrase opening is expected,
> or peek to see if a whitespace follows the double-quotes if a phrase closing
> is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be
> relied on anyway, and allow Hebrew queries to be correctly parsed also when
> containing acronyms.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]