[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Uwe Schindler (JIRA) Mon, 02 Nov 2015 03:18:07 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14985076#comment-14985076
 ]


Uwe Schindler commented on LUCENE-6874:
---------------------------------------

My personal opinion on this:
- The thing is called WhitespaceTokenizer, so it should do what the name says 
(split on isWhitespace).
- If we want something else, maybe provide a separate CharTokenizer 
implementation that also splits on NBSP

In general, whitespace tokenizer is not used for "classical" fulltext. For this 
type of text one would better use StandardTokenizer, ICU's Tokenizers or the 
language specific ones for Chinese or Japan. People using WhitespaceTokenizer 
are more those people which have very special types of fields, like a list of 
whitespace-separated tokens used for facetting or stuff like a list of product 
numbers. These types of tokens were always good to handle with 
WhitespaceTokenizer. If you wanted to keep your facet tokens together, you were 
able to use NBSP! So a change here would be a break for those apps :-)

So I would just update documentation to explain what this thing does (splitting 
on whitespace and not on spaces in general).

> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
>                 Key: LUCENE-6874
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6874
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Priority: Minor
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Reply via email to