[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14985540#comment-14985540 ]
Jack Krupansky commented on LUCENE-6874: ---------------------------------------- +1 for using the Unicode definition of white space rather than the (odd) Java definition. From a Solr user perspective, the fact that Java is used for implementation under the hood should be irrelevant. That said, the Javadoc for WhitespaceTokenizer#isTokenChar does explicitly refer to isWhitespace already. The term "non-breaking white space" explicitly refers to line breaking and has no mention of tokens in either Unicode or traditional casual usage. >From a Solr user perspective, there is like zero value to having NBSP from >HTML web pages being treated as if it were not traditional white space. >From a Solr user perspective, the primary use of whitespace tokenizer is to >avoid the fact that standard tokenizer breaks on various special characters >such as occur in product numbers. In short, the benefits to Solr users for NBSP being tokenized as white space seem to outweigh any minor use cases for treating it as non-white space. A compatibility mode can be provided if those minor use cases are considered truly worthwhile. > WhitespaceTokenizer should tokenize on NBSP > ------------------------------------------- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Reporter: David Smiley > Priority: Minor > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org