[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997645#comment-14997645
]
Adrien Grand commented on LUCENE-6874:
--------------------------------------
I tend to like Uwe's idea. I have often wondered what the actual use-cases of
WhitespaceTokenizer were but did not suggest to remove it as the cost of
maintenance was very low given its simplicity. However now that there is some
controversy arising and given how simple it is to create character-based
tokenizers in trunk {{Tokenizer tok =
CharTokenizer.fromSeparatorCharPredicate(Character::isWhitespace);}}, maybe we
should just remove this tokenizer and let users define it themselves with the
more flexible {{CharTokenizer.fromSeparatorCharPredicate}}?
> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: David Smiley
> Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch,
> LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
> to decide what is whitespace. Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0',
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to
> work around but why leave this trap in by default?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]