[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Uwe Schindler (JIRA) Mon, 02 Nov 2015 10:54:13 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14985749#comment-14985749
 ]


Uwe Schindler commented on LUCENE-6874:
---------------------------------------

bq. Then we could consider deprecating WhitespaceTokenizer since, after all, 
why would one use it when ICUWhitespaceTokenizer exists?

Because the non-breaking space is useful for stuff (as explained above) where 
you want to keep tokens together (although usicode standard speaks about line 
wrapping, but in any case, like soft hyphen vs. hyphen, its just a matter what 
you want to do: the NBSP just tells the tokenizer or line-breaker or how you 
call it to keep tokens together). The problem are people that misuse {{&nbsp;}} 
in their HTML shit (e.g. tables). But for stuff I had implemented very often, I 
used WhitepsaceTokenizer to split tokens and placed a non-breaking space to 
keep tokens together.

So there is no need to deprecate WhitespaceTokenizer. It does what it should 
do. ICUWhitespaceTokenizer is using same naming and does the same, just with 
different rules.

bq. ... or consider a MappingCharFilter

This is thing is slow like hell. If you want it faster, e.g. use 
PatternTokenizer.

bq. It would be so nice to not need it, even if it's internal implementation 
seems fast

The problem is that you need to do additional 4-way branching: You have to 
check in {{isTokenChar()}} that it is {{!Whitespace()}} and also exclude all 
those 3 chars we listed in the description: {{'\u00A0', '\u2007', '\u202F'}}.

I agree with Robert: We should not change the default WhitespaceTokenizer and 
also not deprecate it. We should add a new one, which I did in supplied patch. 
If we want it in core, lets call it different and implement isTokenChar in a 
fast way without 3 additional branches.

bq. I beg to differ on WDF

This is coming from the fact that Solr is often misused because users just give 
up to think about tokenization. WDF only makes sense in product catalogues, but 
it is definitely broken for fulltext. The product catalogues are of course some 
of our customers, but before I suggest to them that they should use 
WhiteSpaceTokenizer with WordDestroyerFilter, I would analyzer their root 
problem (why is their tokenization broken). This is why I am against the broken 
example configs in Solr we had in the past. Because WST and WDF should really 
only be used as a last resort.

> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
>                 Key: LUCENE-6874
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6874
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Priority: Minor
>         Attachments: LUCENE-6874.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Reply via email to