[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Jack Krupansky (JIRA) Mon, 02 Nov 2015 09:04:06 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14985540#comment-14985540
 ]


Jack Krupansky commented on LUCENE-6874:
----------------------------------------

+1 for using the Unicode definition of white space rather than the (odd) Java 
definition. From a Solr user perspective, the fact that Java is used for 
implementation under the hood should be irrelevant. That said, the Javadoc for 
WhitespaceTokenizer#isTokenChar does explicitly refer to isWhitespace already.

The term "non-breaking white space" explicitly refers to line breaking and has 
no mention of tokens in either Unicode or traditional casual usage.

>From a Solr user perspective, there is like zero value to having NBSP from 
>HTML web pages being treated as if it were not traditional white space.

>From a Solr user perspective, the primary use of whitespace tokenizer is to 
>avoid the fact that standard tokenizer breaks on various special characters 
>such as occur in product numbers.

In short, the benefits to Solr users for NBSP being tokenized as white space 
seem to outweigh any minor use cases for treating it as non-white space. A 
compatibility mode can be provided if those minor use cases are considered 
truly worthwhile.


> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
>                 Key: LUCENE-6874
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6874
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Priority: Minor
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Reply via email to