[ https://issues.apache.org/jira/browse/SOLR-13448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834423#comment-16834423 ]
Steve Rowe commented on SOLR-13448: ----------------------------------- The documentation is wrong. The quoted sentence was inherited from Classic Tokenizer's description. UAX 29 URL Email Tokenizer is a specialization of Standard Tokenizer, the 7.2 documentation for which says the following: Note that words are split at hyphens. The ref guide should be updated to use the above sentence. > UAX29 URL Email Tokenizer: Ref guide description of hyphen handling is wrong > ---------------------------------------------------------------------------- > > Key: SOLR-13448 > URL: https://issues.apache.org/jira/browse/SOLR-13448 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: documentation > Affects Versions: 7.2 > Reporter: Steve Rowe > Assignee: Steve Rowe > Priority: Minor > > As reported on the Solr user mailing list by Tom Van Cuyck: > The UAX29 URL Email Tokenizer is not working as expected. > According to the documentation ( > https://lucene.apache.org/solr/guide/7_2/tokenizers.html): "Words are split > at hyphens, unless there is a number in the word, in which case the token > is not split and the numbers and hyphen(s) are preserved." > So I expect "ABC-123" to remain "ABC-123" > However the term is split in 2 separate tokens "ABC" and "123". > Same for "AB12-CD34" --> "AB12" and "CD34" etc... > Is this behavior to be expected? Or is there a way to get the behavior I > expect? -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org