[ 
https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868001#action_12868001
 ] 

Robert Muir commented on LUCENE-2465:
-------------------------------------

bq. Also, GERSHAYIM is simply not a valid argument - users cannot type Unicode, 
they type text.

I am suggesting we follow the rules of unicode, for a few reasons.
# This is not unique to hebrew Gershayim. The same problem is found in numerous 
other languages, where query parser syntax overlaps with "incorrect unicode" 
text in those languages. I have this same issue with the conflation of : and 
Bengali ঃ, and in some other charsets there is only one glyph for both.
# Adding some heuristic that does not obey the rules of unicode risks breaking 
other languages. While it might seem perfectly harmless, we risk doing harmful 
things to other languages. This is like what happens to Chinese text today.
# Disambiguating when a ' should be a gershayim is really app-dependent, just 
like disambiguating when : should be  ঃ. Its a subproblem of character set 
conversion (which is not always lossless and exact), and charset conversion 
doesnt belong in the query parser.

So, adding some of the heuristics i see here will change phrase queries for 
example, for languages that dont use spaces between words like Thai. Trying to 
base it on Unicode properties, is very risky, ultimately it will probably break 
some language because words arent just sequences of letters separated by 
whitespace in all languages.

Furthermore, by following Unicode, we keep QP simpler, and it won't 
unintentionally or unknowingly break for any existent or future languages (such 
as ones not even in Unicode yet).


> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 
> 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the 
> query when hitting a double-quotes char, even if it is mid-word. For example, 
> the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term 
> and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase 
> is a group of words surrounded by double quotes as defined by 
> http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does 
> it say double-quotes will also tokenize the input. Arguably, a phrase should 
> only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew 
> acronyms impossible. Hebrew acronyms contain one double-quotes char in the 
> middle of a word (for example, MNK"L), hence causing the QP to throw a syntax 
> exception, since it is expecting another double-quotes to create a phrase 
> query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check 
> if a whitespace precedes the double-quote when a phrase opening is expected, 
> or peek to see if a whitespace follows the double-quotes if a phrase closing 
> is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be 
> relied on anyway, and allow Hebrew queries to be correctly parsed also when 
> containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to