[jira] [Comment Edited] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Jack Krupansky (JIRA) Mon, 02 Nov 2015 09:35:59 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14985540#comment-14985540
 ]


Jack Krupansky edited comment on LUCENE-6874 at 11/2/15 5:34 PM:
-----------------------------------------------------------------

+1 for using the Unicode definition of white space rather than the (odd) Java 
definition. From a Solr user perspective, the fact that Java is used for 
implementation under the hood should be irrelevant. That said, the Javadoc for 
WhitespaceTokenizer#isTokenChar does explicitly refer to isWhitespace already.

The term "non-breaking white space" explicitly refers to line breaking and has 
no mention of tokens in either Unicode or traditional casual usage.

>From a Solr user perspective, there is like zero value to having NBSP from 
>HTML web pages being treated as if it were not traditional white space.

>From a Solr user perspective, the primary use of whitespace tokenizer is to 
>avoid the fact that standard tokenizer breaks on various special characters 
>such as occur in product numbers.

One of the ongoing problems in the Solr community is the sheer amount of time 
spent explaining nuances and gotchas, even if they do happen to be documented 
somewhere in the fine print - no sane user reads the fine print anyway. No Solr 
user actually uses WhitespaceTokenizer directly - they reference 
WhitespaceTokenizerFactory, and then having to drop down to Lucene and Java for 
doc is way too much to ask a typical Solr user. Our collective goal should be 
to minimize nuances and gotchas (IMHO.)

In short, the benefits to Solr users for NBSP being tokenized as white space 
seem to outweigh any minor use cases for treating it as non-white space. A 
compatibility mode can be provided if those minor use cases are considered 
truly worthwhile.

Ugh... there are plenty of other places in doc for other tokenizers and filters 
that refer to "whitespace" and need to address this same issue, either to treat 
NBSP as white space or doc the nuance/gotcha much more thoroughly and 
effectively.

OTOH... an alternative view... having so many un/poorly-documented nuances and 
gotchas is money in the pockets of consultants and a great argument in favor of 
Solr users maximizing the employment of Solr consultants.


was (Author: jkrupan):
+1 for using the Unicode definition of white space rather than the (odd) Java 
definition. From a Solr user perspective, the fact that Java is used for 
implementation under the hood should be irrelevant. That said, the Javadoc for 
WhitespaceTokenizer#isTokenChar does explicitly refer to isWhitespace already.

The term "non-breaking white space" explicitly refers to line breaking and has 
no mention of tokens in either Unicode or traditional casual usage.

>From a Solr user perspective, there is like zero value to having NBSP from 
>HTML web pages being treated as if it were not traditional white space.

>From a Solr user perspective, the primary use of whitespace tokenizer is to 
>avoid the fact that standard tokenizer breaks on various special characters 
>such as occur in product numbers.

In short, the benefits to Solr users for NBSP being tokenized as white space 
seem to outweigh any minor use cases for treating it as non-white space. A 
compatibility mode can be provided if those minor use cases are considered 
truly worthwhile.


> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
>                 Key: LUCENE-6874
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6874
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Priority: Minor
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Reply via email to