[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Adrien Grand (JIRA) Mon, 09 Nov 2015 15:30:40 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997645#comment-14997645
 ]


Adrien Grand commented on LUCENE-6874:
--------------------------------------

I tend to like Uwe's idea. I have often wondered what the actual use-cases of 
WhitespaceTokenizer were but did not suggest to remove it as the cost of 
maintenance was very low given its simplicity. However now that there is some 
controversy arising and given how simple it is to create character-based 
tokenizers in trunk {{Tokenizer tok = 
CharTokenizer.fromSeparatorCharPredicate(Character::isWhitespace);}}, maybe we 
should just remove this tokenizer and let users define it themselves with the 
more flexible {{CharTokenizer.fromSeparatorCharPredicate}}?

> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
>                 Key: LUCENE-6874
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6874
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Priority: Minor
>         Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Reply via email to