[ 
https://issues.apache.org/jira/browse/SOLR-9250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15517479#comment-15517479
 ] 

Steve Rowe commented on SOLR-9250:
----------------------------------

StandardTokenizer currently implements the word break rules from Unicode 6.3.0, 
so the references below are to resources from that version.

In rules and test cases below, the {{÷}} symbol means a break is required, and 
the {{×}} symbol means a break is disallowed.

The dollar sign (U+0024) and euro symbol (U+20AC) are both in the same class 
for the purposes of UAX#29's word break rules: "Other" (that is, not in any 
designated word break character classes - see the full list of word break 
properties here 
[http://www.unicode.org/Public/6.3.0/ucd/auxiliary/WordBreakProperty.txt]).  
For characters in the "Other" class (referred to as "999.0" in the test cases 
given below), the word break rule WB14 applies (from 
[http://www.unicode.org/reports/tr29/tr29-23.html#WB14]):

{quote}
Otherwise, break everywhere (including around ideographs).

WB14.   Any     ÷       Any
{quote}

Unicode supplies a set of test cases for word break rules 
([http://www.unicode.org/Public/6.3.0/ucd/auxiliary/WordBreakTest.txt]).  
U+20AC doesn't appear in these test cases, but U+0024 does in two of them - the 
character names and classes are given in the info after the {{#}} character on 
each line:

{noformat}
÷ 0061 ÷ 0024 ÷ 002D ÷ 0033 × 0034 × 002C × 0035 × 0036 × 0037 × 002E × 0031 × 
0034 ÷ 0025 ÷ 0062 ÷     #  ÷ [0.2] LATIN SMALL LETTER A (ALetter) ÷ [999.0] 
DOLLAR SIGN (Other) ÷ [999.0] HYPHEN-MINUS (Other) ÷ [999.0] DIGIT THREE 
(Numeric) × [8.0] DIGIT FOUR (Numeric) × [12.0] COMMA (MidNum) × [11.0] DIGIT 
FIVE (Numeric) × [8.0] DIGIT SIX (Numeric) × [8.0] DIGIT SEVEN (Numeric) × 
[12.0] FULL STOP (MidNumLet) × [11.0] DIGIT ONE (Numeric) × [8.0] DIGIT FOUR 
(Numeric) ÷ [999.0] PERCENT SIGN (Other) ÷ [999.0] LATIN SMALL LETTER B 
(ALetter) ÷ [0.3]
[...]
÷ 2060 ÷ 0061 × 2060 ÷ 0024 × 2060 ÷ 002D × 2060 ÷ 0033 × 2060 × 0034 × 2060 × 
002C × 2060 × 0035 × 2060 × 0036 × 2060 × 0037 × 2060 × 002E × 2060 × 0031 × 
2060 × 0034 × 2060 ÷ 0025 × 2060 ÷ 0062 × 2060 × 2060 ÷     #  ÷ [0.2] WORD 
JOINER (Format_FE) ÷ [999.0] LATIN SMALL LETTER A (ALetter) × [4.0] WORD JOINER 
(Format_FE) ÷ [999.0] DOLLAR SIGN (Other) × [4.0] WORD JOINER (Format_FE) 
{noformat}


> Search breaks with EU symbol € and wildcard *
> ---------------------------------------------
>
>                 Key: SOLR-9250
>                 URL: https://issues.apache.org/jira/browse/SOLR-9250
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Server
>    Affects Versions: 5.3.1
>            Reporter: Tim Nolan
>         Attachments: contact-name-analyze.png, contact-name-field-type.png
>
>
> While testing UTF-8 character searches, which worked, we have noticed a 
> combination that fails. Testing with the data {{Tùûüÿ€àâæçéèêëïîôœm}}, we 
> found the search worked, but by adding a wild-card (e.g. 
> {{Tùûüÿ€àâæçéèêëïîôœm*}}), the search fails. Adding the wildcard before the 
> {{€}} symbol worked (i.e. {{Tùûüÿ*}}).
> Showing the logs for these queries:
> {noformat:title=Full text without wildcard, hit=1}
> 2016-06-25 13:16:34.361 [qtp237852351-21] INFO  
> org.apache.solr.core.SolrCore.Request  – [core-name] webapp=/solr 
> path=/select 
> params={q=Tùûüÿ€àâæçéèêëïîôœm&indent=true&fq=type:CONTACT&rows=12&wt=json&_=1466860594348}
>  hits=1 status=0 QTime=0 
> {noformat}
> {noformat:title=Full text with wildcard, hit=0}
> 2016-06-25 13:16:41.172 [qtp237852351-16] INFO  
> org.apache.solr.core.SolrCore.Request  – [core-name] webapp=/solr 
> path=/select 
> params={q=Tùûüÿ€àâæçéèêëïîôœm*&indent=true&fq=type:CONTACT&rows=12&wt=json&_=1466860601160}
>  hits=0 status=0 QTime=0 
> {noformat}
> {noformat:title=Partial text before € with wildcard, hit=1}
> 2016-06-25 13:16:52.135 [qtp237852351-18] INFO  
> org.apache.solr.core.SolrCore.Request  – [core-name] webapp=/solr 
> path=/select 
> params={q=Tùûüÿ*&indent=true&fq=type:CONTACT&rows=12&wt=json&_=1466860612125} 
> hits=1 status=0 QTime=2 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to