[jira] [Commented] (SOLR-7154) Wildcard query matches special characters

Toke Eskildsen (JIRA) Tue, 24 Feb 2015 12:23:35 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-7154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335381#comment-14335381
 ]


Toke Eskildsen commented on SOLR-7154:
--------------------------------------

To quote Steve Jobs: "You are holding it wrong": Your Beyoncé should be 
Beyoncé.

The difference between the two e's is that the first one has the ping added 
with the unicode "Combining acute accent" 
(http://www.fileformat.info/info/unicode/char/0301/index.htm), while the second 
one is a "Latin small letter with acute" 
(http://www.fileformat.info/info/unicode/char/e9/index.htm).

A proper normalizer would convert é and é into the same character, but you are 
using the raw string, so you do not have that luxury. If you use a text field, 
you can avoid this by normalising into letters with build-in diacritics (as 
opposed to letters followed with combining diacritics). Unfortunately that does 
not work well if the user query contains a truncation with combining 
diacritics, as truncated queries are not normalized (which I think they should, 
but that is a matter for another JIRA).

> Wildcard query matches special characters
> -----------------------------------------
>
>                 Key: SOLR-7154
>                 URL: https://issues.apache.org/jira/browse/SOLR-7154
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Arun Rangarajan
>            Priority: Minor
>
> I have a string field raw_name defined like this:
> {code}
> <fieldType name="string" class="solr.StrField" sortMissingLast="true" 
> omitNorms="true"/>
> ...
> <field name="raw_name" type="string" indexed="true" stored="true" />
> {code}
> I have a document like this:
> {code}
> {raw_name: beyoncé}
> {code}
> Notice that the last character is a special character (accented e).
> When I issue this wildcard query:
> {code}
> q=raw_name:beyonce*
> {code}
> i.e. with the last character simply being the ASCII 'e', Solr returns me the 
> above document.
> Exact query:
> {code}
> /select?q=raw_name:beyonce*&wt=json&fl=raw_name
> {code}
> Response:
> {code}
> {
>   "responseHeader": {
>     "status": 0,
>     "QTime": 0,
>     "params": {
>       "fl": "raw_name",
>       "q": "raw_name:beyonce*",
>       "wt": "json"
>     }
>   },
>   "response": {
>     "numFound": 2,
>     "start": 0,
>     "docs": [
>       {
>         "raw_name": "beyoncé"
>       },
>       {
>         "raw_name": "beyoncé"
>       }
>     ]
>   }
> }
> {code}
> I used the analysis tool in Solr admin (with Jetty). The raw bytes look like 
> this:
> Raw bytes for beyonce: [62 65 79 6f 6e 63 65]
> Raw bytes for beyoncé: [62 65 79 6f 6e 63 65 cc 81]
> So when you look at the bytes, it seems to explain why beyonce* might match 
> beyoncé.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-7154) Wildcard query matches special characters

Reply via email to