[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Shai Erera (JIRA) Sun, 16 May 2010 10:32:04 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868013#action_12868013
 ]


Shai Erera commented on LUCENE-2465:
------------------------------------

It's not that I disagree with what you say Robert, but I think we're arguing 
two different things. Please correct me if I'm wrong, but ঃ does not denote a 
field:value delimiter by the QueryParser, right? I've tried the following query 
"fooঃbar" and it was still parsed to c:fooঃbar.

So my point is - the query syntax declares some characters that if present in 
the query string are parsed accordingly, but also if they adhere to some format 
(e.g. foo+bar does not tokenize on '+'). And we don't declare a Unicode 
standard for the query syntax - only \u0022 is considered a quote for phrase, 
and not all the other double quote forms, like Gershayim.

Therefore, I don't think the examples of Chinese et al. are relevant because it 
is not the job of the QP to parse them properly, but the job of the Analyzer. 
The QP needs to tokenize the terms properly, and needs to build the query tree 
properly. For that, it declares several characters, ' ' (space) is one of them, 
and if the user wants to write proper queries, he should use them. The QP 
itself does not follow any Unicode standard right?

A Hebrew example to describe the quotes problem is, and I write it in English 
'cause you don't read Hebrew ... yet :), US"A, which is the acronym of United 
States of America. That's a valid word. The user, following the current 
guidelines, should write it as US\"A if he wants the quote to be retained, and 
the question is whether we can do something to relax that requirement. The same 
would follow for any other query syntax reserved character that is used not in 
its correct syntax-place ...

I do agree though that such change is problematic backwards-wise, b/c 
previously failing queries may suddenly succeed. Specifically, the parser won't 
throw ParseException if the user makes a mistake, in languages in which " is 
not a valid mid-word character. But I also feel that the current behavior is 
wrong .. or at least too restrictive. And ... the reserved characters behave 
inconsistently. BTW, FWIW "foo:" (w/ : and no value) also throws ParseException 
...

> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 
> 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the 
> query when hitting a double-quotes char, even if it is mid-word. For example, 
> the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term 
> and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase 
> is a group of words surrounded by double quotes as defined by 
> http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does 
> it say double-quotes will also tokenize the input. Arguably, a phrase should 
> only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew 
> acronyms impossible. Hebrew acronyms contain one double-quotes char in the 
> middle of a word (for example, MNK"L), hence causing the QP to throw a syntax 
> exception, since it is expecting another double-quotes to create a phrase 
> query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check 
> if a whitespace precedes the double-quote when a phrase opening is expected, 
> or peek to see if a whitespace follows the double-quotes if a phrase closing 
> is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be 
> relied on anyway, and allow Hebrew queries to be correctly parsed also when 
> containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2465) QueryParser should ignore double-quotes if mid-word

Reply via email to