[ 
https://issues.apache.org/jira/browse/SOLR-13448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834423#comment-16834423
 ] 

Steve Rowe commented on SOLR-13448:
-----------------------------------

The documentation is wrong.  The quoted sentence was inherited from Classic 
Tokenizer's description.  UAX 29 URL Email Tokenizer is a specialization of 
Standard Tokenizer, the 7.2 documentation for which says the following:

    Note that words are split at hyphens.

The ref guide should be updated to use the above sentence.


> UAX29 URL Email Tokenizer: Ref guide description of hyphen handling is wrong
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-13448
>                 URL: https://issues.apache.org/jira/browse/SOLR-13448
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: documentation
>    Affects Versions: 7.2
>            Reporter: Steve Rowe
>            Assignee: Steve Rowe
>            Priority: Minor
>
> As reported on the Solr user mailing list by Tom Van Cuyck:
> The UAX29 URL Email Tokenizer is not working as expected.
> According to the documentation (
> https://lucene.apache.org/solr/guide/7_2/tokenizers.html): "Words are split
> at hyphens, unless there is a number in the word, in which case the token
> is not split and the numbers and hyphen(s) are preserved."
> So I expect "ABC-123" to remain "ABC-123"
> However the term is split in 2 separate tokens "ABC" and "123".
> Same for "AB12-CD34" --> "AB12" and "CD34" etc...
> Is this behavior to be expected? Or is there a way to get the behavior I
> expect?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to