[ 
https://issues.apache.org/jira/browse/LUCENE-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868006#action_12868006
 ] 

Robert Muir commented on LUCENE-2465:
-------------------------------------

Also, by the way there are foldings and processing in contrib/ICU to help with 
this use case specifically.

# ICUTokenizer contains a tailored Hebrew grammar, that *only* applies to 
Hebrew script. It adds the double quote " to MidLetter and the single quote ' 
to Extend only for Hebrew context. Of course, geresh and gershayim work the 
same way.
# ICUFoldingFilter then contains logic to cause geresh and ', gershayim and " 
to conflate.

So, I recommend this approach:

All you have to do is disambiguate at query-time, if the double quote should be 
a phrase query, or should be part of an acronym, by mapping it to the correct 
unicode character (gershayim) for the latter case. You can use a regex or 
dictionary-based approach, or user-feedback or whatever approach you feel is 
best.

At index and query-time, if you use these components (or use similar 
mappings/handling in your own tokenstreams), all forms will conflate to the 
same terms and match during your search. You are just disambiguating to the 
correct unicode characters to cause the parse that you want.



> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.1, 
> 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the 
> query when hitting a double-quotes char, even if it is mid-word. For example, 
> the string ' Foo"bar test" ' will produce a BooleanQuery, holding one term 
> and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a Phrase 
> is a group of words surrounded by double quotes as defined by 
> http://lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does 
> it say double-quotes will also tokenize the input. Arguably, a phrase should 
> only be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew 
> acronyms impossible. Hebrew acronyms contain one double-quotes char in the 
> middle of a word (for example, MNK"L), hence causing the QP to throw a syntax 
> exception, since it is expecting another double-quotes to create a phrase 
> query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to check 
> if a whitespace precedes the double-quote when a phrase opening is expected, 
> or peek to see if a whitespace follows the double-quotes if a phrase closing 
> is expected.
> This will both eliminate a logically incorrect behavior which shouldn't be 
> relied on anyway, and allow Hebrew queries to be correctly parsed also when 
> containing acronyms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to