[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Dawid Weiss (JIRA) Mon, 02 Nov 2015 00:23:55 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984873#comment-14984873
 ]


Dawid Weiss commented on LUCENE-6874:
-------------------------------------

Depends what you consider a trap. 

A non-breakable whitespace could be a legitimate way to prevent two tokens from 
being separated if they need to be tokenized together. An example that comes to 
my mind is the special "zero-width" space or the hyphenation marker... which 
even on its own poses a problem [1]...

Ultimately it should be probably the question of whether we want to tokenize on 
"whitespace as in formatted text" or "whitespace as in logical codepoint units" 
and it doesn't apply to the WhitespaceTokenizer only, but to any tokenizer in 
general?

bq. I think WhitespaceTokenizer should tokenize on this.

Seems like majority of people would want it to be tokenized, I agree. But if 
you change this then there is no way to go back to previous behavior. Currently 
it's relatively easy to wrap your input in a reader that replaces those 
problematic codepoints on the fly before they're fed to the tokenizer?

[1] https://www.cs.tut.fi/~jkorpela/shy.html

> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
>                 Key: LUCENE-6874
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6874
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Priority: Minor
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Reply via email to