[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

David Smiley (JIRA) Mon, 02 Nov 2015 09:43:56 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14985598#comment-14985598
 ]


David Smiley commented on LUCENE-6874:
--------------------------------------

bq. So maybe we should solve this problem by adding some documentation?

If the vast majority (like 90%+) of users that currently use 
WhitespaceTokenizer would want to tokenize on it, then I don't think 
documentation is sufficient at all.  Documenting something most people would 
want to change is very very easy to overlook.  That's what I call a _trap_; not 
that there might be some uses for the current behavior.  Lucene should do what 
most users want it do do by default.  As Jack said, the users of the search 
platform don't care what Java's definition of Character.isWhitespace is.

I propose WhitespaceTokenizerFactory have a flag for this, and that it default 
to consider NBSP a space based on Lucene's Version.

I get Uwe's point that there are other Tokenizers.  But I disagree that 
WhitespaceTokenizer shouldn't be used for "classical full text".  For example 
StandardTokenizer tokenizes on hypthen and thus foils some of the benefit of 
WordDelimiterFilter.  Maybe ICUTokenizer is an answer; I haven't checked it's 
interaction with WDF.  But why can't we just have a tokenizer that just 
tokenizes simply on all whitespace?

I'll have to see the links Rob just posted; I haven't read them yet.

> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
>                 Key: LUCENE-6874
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6874
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Priority: Minor
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Reply via email to