[
https://issues.apache.org/jira/browse/SOLR-9250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15517479#comment-15517479
]
Steve Rowe commented on SOLR-9250:
----------------------------------
StandardTokenizer currently implements the word break rules from Unicode 6.3.0,
so the references below are to resources from that version.
In rules and test cases below, the {{÷}} symbol means a break is required, and
the {{×}} symbol means a break is disallowed.
The dollar sign (U+0024) and euro symbol (U+20AC) are both in the same class
for the purposes of UAX#29's word break rules: "Other" (that is, not in any
designated word break character classes - see the full list of word break
properties here
[http://www.unicode.org/Public/6.3.0/ucd/auxiliary/WordBreakProperty.txt]).
For characters in the "Other" class (referred to as "999.0" in the test cases
given below), the word break rule WB14 applies (from
[http://www.unicode.org/reports/tr29/tr29-23.html#WB14]):
{quote}
Otherwise, break everywhere (including around ideographs).
WB14. Any ÷ Any
{quote}
Unicode supplies a set of test cases for word break rules
([http://www.unicode.org/Public/6.3.0/ucd/auxiliary/WordBreakTest.txt]).
U+20AC doesn't appear in these test cases, but U+0024 does in two of them - the
character names and classes are given in the info after the {{#}} character on
each line:
{noformat}
÷ 0061 ÷ 0024 ÷ 002D ÷ 0033 × 0034 × 002C × 0035 × 0036 × 0037 × 002E × 0031 ×
0034 ÷ 0025 ÷ 0062 ÷ # ÷ [0.2] LATIN SMALL LETTER A (ALetter) ÷ [999.0]
DOLLAR SIGN (Other) ÷ [999.0] HYPHEN-MINUS (Other) ÷ [999.0] DIGIT THREE
(Numeric) × [8.0] DIGIT FOUR (Numeric) × [12.0] COMMA (MidNum) × [11.0] DIGIT
FIVE (Numeric) × [8.0] DIGIT SIX (Numeric) × [8.0] DIGIT SEVEN (Numeric) ×
[12.0] FULL STOP (MidNumLet) × [11.0] DIGIT ONE (Numeric) × [8.0] DIGIT FOUR
(Numeric) ÷ [999.0] PERCENT SIGN (Other) ÷ [999.0] LATIN SMALL LETTER B
(ALetter) ÷ [0.3]
[...]
÷ 2060 ÷ 0061 × 2060 ÷ 0024 × 2060 ÷ 002D × 2060 ÷ 0033 × 2060 × 0034 × 2060 ×
002C × 2060 × 0035 × 2060 × 0036 × 2060 × 0037 × 2060 × 002E × 2060 × 0031 ×
2060 × 0034 × 2060 ÷ 0025 × 2060 ÷ 0062 × 2060 × 2060 ÷ # ÷ [0.2] WORD
JOINER (Format_FE) ÷ [999.0] LATIN SMALL LETTER A (ALetter) × [4.0] WORD JOINER
(Format_FE) ÷ [999.0] DOLLAR SIGN (Other) × [4.0] WORD JOINER (Format_FE)
{noformat}
> Search breaks with EU symbol € and wildcard *
> ---------------------------------------------
>
> Key: SOLR-9250
> URL: https://issues.apache.org/jira/browse/SOLR-9250
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: Server
> Affects Versions: 5.3.1
> Reporter: Tim Nolan
> Attachments: contact-name-analyze.png, contact-name-field-type.png
>
>
> While testing UTF-8 character searches, which worked, we have noticed a
> combination that fails. Testing with the data {{Tùûüÿ€àâæçéèêëïîôœm}}, we
> found the search worked, but by adding a wild-card (e.g.
> {{Tùûüÿ€àâæçéèêëïîôœm*}}), the search fails. Adding the wildcard before the
> {{€}} symbol worked (i.e. {{Tùûüÿ*}}).
> Showing the logs for these queries:
> {noformat:title=Full text without wildcard, hit=1}
> 2016-06-25 13:16:34.361 [qtp237852351-21] INFO
> org.apache.solr.core.SolrCore.Request – [core-name] webapp=/solr
> path=/select
> params={q=Tùûüÿ€àâæçéèêëïîôœm&indent=true&fq=type:CONTACT&rows=12&wt=json&_=1466860594348}
> hits=1 status=0 QTime=0
> {noformat}
> {noformat:title=Full text with wildcard, hit=0}
> 2016-06-25 13:16:41.172 [qtp237852351-16] INFO
> org.apache.solr.core.SolrCore.Request – [core-name] webapp=/solr
> path=/select
> params={q=Tùûüÿ€àâæçéèêëïîôœm*&indent=true&fq=type:CONTACT&rows=12&wt=json&_=1466860601160}
> hits=0 status=0 QTime=0
> {noformat}
> {noformat:title=Partial text before € with wildcard, hit=1}
> 2016-06-25 13:16:52.135 [qtp237852351-18] INFO
> org.apache.solr.core.SolrCore.Request – [core-name] webapp=/solr
> path=/select
> params={q=Tùûüÿ*&indent=true&fq=type:CONTACT&rows=12&wt=json&_=1466860612125}
> hits=1 status=0 QTime=2
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]