[
https://issues.apache.org/jira/browse/SOLR-13448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834423#comment-16834423
]
Steve Rowe commented on SOLR-13448:
-----------------------------------
The documentation is wrong. The quoted sentence was inherited from Classic
Tokenizer's description. UAX 29 URL Email Tokenizer is a specialization of
Standard Tokenizer, the 7.2 documentation for which says the following:
Note that words are split at hyphens.
The ref guide should be updated to use the above sentence.
> UAX29 URL Email Tokenizer: Ref guide description of hyphen handling is wrong
> ----------------------------------------------------------------------------
>
> Key: SOLR-13448
> URL: https://issues.apache.org/jira/browse/SOLR-13448
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: documentation
> Affects Versions: 7.2
> Reporter: Steve Rowe
> Assignee: Steve Rowe
> Priority: Minor
>
> As reported on the Solr user mailing list by Tom Van Cuyck:
> The UAX29 URL Email Tokenizer is not working as expected.
> According to the documentation (
> https://lucene.apache.org/solr/guide/7_2/tokenizers.html): "Words are split
> at hyphens, unless there is a number in the word, in which case the token
> is not split and the numbers and hyphen(s) are preserved."
> So I expect "ABC-123" to remain "ABC-123"
> However the term is split in 2 separate tokens "ABC" and "123".
> Same for "AB12-CD34" --> "AB12" and "CD34" etc...
> Is this behavior to be expected? Or is there a way to get the behavior I
> expect?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]