[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14985593#comment-14985593
 ] 

Uwe Schindler commented on LUCENE-6874:
---------------------------------------

bq. In short, the benefits to Solr users for NBSP being tokenized as white 
space seem to outweigh any minor use cases for treating it as non-white space. 
A compatibility mode can be provided if those minor use cases are considered 
truly worthwhile.

As said before. If we want to change this we need a new Tokenizer with new name 
and a new Factory. Please don't add new matchVersion constants for that because 
this is a huge break. The Tokenizer does what it should and what is documentes: 
This is not a bug.

And still this holds: Users should prefer StandardTokenizer, the wide usage of 
WhitespaceTokenizer is caused by tons of example configs from earlier Solr days 
that uses WhiteSpaceTokenizer together with broken WordDestroyerFilter. This is 
indeed only useful for product numbers, but not fulltext.

> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
>                 Key: LUCENE-6874
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6874
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Priority: Minor
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to