[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Itamar Syn-Hershko (JIRA) Mon, 17 May 2010 05:02:12 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868183#action_12868183
 ]


Itamar Syn-Hershko commented on LUCENE-2465:
--------------------------------------------

bq. This is why i say, the only solution is to follow unicode. Adding hacks 
like this will only break other languages.

Problem is, Hebrew parsing has been broken for a long time now, and this still 
needs fixing. I don't think you should be forcing extra pre-handling for Hebrew 
or Bengali (or other) queries, just to keep CJK parsing working out of the box. 
Escaping those cases by the caller is a much more complex operation than a 
normal escape you'd do on your queries.

For languages where a colon is being used as a character, if indeed the use 
case is the same as mid-word gershayim (i.e. there's no key for that letter and 
it is more of a letter than a punctuation char), the issue with the QP is the 
same.

If the solution I had proposed initially wouldn't have caused other issues with 
CJK phrases, I'd insist on it. However, you are obviously right this change 
would break functionality for those languages, but you are wrong claiming it is 
not up to the query parser to resolve. As Shai have already pointed out, the QP 
should parse based on syntax with the smallest hassle to the consumer.

Obviously, a solution has to be provided, and it is agreed it should not affect 
the variety of supported languages. How about creating this functionality and 
leaving it as optional? for CJK you'd leave it off, while for all other 
languages (English and European) you could turn it on and feel no difference at 
the worse case scenario.

Or, you could have this setting accessible from your Analyzer. Analyzers are 
defining the core's behavior per-language, and as such it would make sense to 
make the QP check with the analyzer which cases are a syntax error and which 
aren't.

> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 
> 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the 
> query when hitting a double-quotes char, even if it is mid-word. For example, 
> the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term 
> and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase 
> is a group of words surrounded by double quotes as defined by 
> http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does 
> it say double-quotes will also tokenize the input. Arguably, a phrase should 
> only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew 
> acronyms impossible. Hebrew acronyms contain one double-quotes char in the 
> middle of a word (for example, MNK"L), hence causing the QP to throw a syntax 
> exception, since it is expecting another double-quotes to create a phrase 
> query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check 
> if a whitespace precedes the double-quote when a phrase opening is expected, 
> or peek to see if a whitespace follows the double-quotes if a phrase closing 
> is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be 
> relied on anyway, and allow Hebrew queries to be correctly parsed also when 
> containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Reply via email to