[
https://issues.apache.org/jira/browse/SOLR-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335381#comment-14335381
]
Toke Eskildsen commented on SOLR-7154:
--------------------------------------
To quote Steve Jobs: "You are holding it wrong": Your Beyoncé should be
Beyoncé.
The difference between the two e's is that the first one has the ping added
with the unicode "Combining acute accent"
(http://www.fileformat.info/info/unicode/char/0301/index.htm), while the second
one is a "Latin small letter with acute"
(http://www.fileformat.info/info/unicode/char/e9/index.htm).
A proper normalizer would convert é and é into the same character, but you are
using the raw string, so you do not have that luxury. If you use a text field,
you can avoid this by normalising into letters with build-in diacritics (as
opposed to letters followed with combining diacritics). Unfortunately that does
not work well if the user query contains a truncation with combining
diacritics, as truncated queries are not normalized (which I think they should,
but that is a matter for another JIRA).
> Wildcard query matches special characters
> -----------------------------------------
>
> Key: SOLR-7154
> URL: https://issues.apache.org/jira/browse/SOLR-7154
> Project: Solr
> Issue Type: Bug
> Reporter: Arun Rangarajan
> Priority: Minor
>
> I have a string field raw_name defined like this:
> {code}
> <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> omitNorms="true"/>
> ...
> <field name="raw_name" type="string" indexed="true" stored="true" />
> {code}
> I have a document like this:
> {code}
> {raw_name: beyoncé}
> {code}
> Notice that the last character is a special character (accented e).
> When I issue this wildcard query:
> {code}
> q=raw_name:beyonce*
> {code}
> i.e. with the last character simply being the ASCII 'e', Solr returns me the
> above document.
> Exact query:
> {code}
> /select?q=raw_name:beyonce*&wt=json&fl=raw_name
> {code}
> Response:
> {code}
> {
> "responseHeader": {
> "status": 0,
> "QTime": 0,
> "params": {
> "fl": "raw_name",
> "q": "raw_name:beyonce*",
> "wt": "json"
> }
> },
> "response": {
> "numFound": 2,
> "start": 0,
> "docs": [
> {
> "raw_name": "beyoncé"
> },
> {
> "raw_name": "beyoncé"
> }
> ]
> }
> }
> {code}
> I used the analysis tool in Solr admin (with Jetty). The raw bytes look like
> this:
> Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
> Raw bytes for beyoncé: [62 65 79 6f 6e 63 65 cc 81]
> So when you look at the bytes, it seems to explain why beyonce* might match
> beyoncé.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]