[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14985749#comment-14985749 ]
Uwe Schindler commented on LUCENE-6874: --------------------------------------- bq. Then we could consider deprecating WhitespaceTokenizer since, after all, why would one use it when ICUWhitespaceTokenizer exists? Because the non-breaking space is useful for stuff (as explained above) where you want to keep tokens together (although usicode standard speaks about line wrapping, but in any case, like soft hyphen vs. hyphen, its just a matter what you want to do: the NBSP just tells the tokenizer or line-breaker or how you call it to keep tokens together). The problem are people that misuse {{ }} in their HTML shit (e.g. tables). But for stuff I had implemented very often, I used WhitepsaceTokenizer to split tokens and placed a non-breaking space to keep tokens together. So there is no need to deprecate WhitespaceTokenizer. It does what it should do. ICUWhitespaceTokenizer is using same naming and does the same, just with different rules. bq. ... or consider a MappingCharFilter This is thing is slow like hell. If you want it faster, e.g. use PatternTokenizer. bq. It would be so nice to not need it, even if it's internal implementation seems fast The problem is that you need to do additional 4-way branching: You have to check in {{isTokenChar()}} that it is {{!Whitespace()}} and also exclude all those 3 chars we listed in the description: {{'\u00A0', '\u2007', '\u202F'}}. I agree with Robert: We should not change the default WhitespaceTokenizer and also not deprecate it. We should add a new one, which I did in supplied patch. If we want it in core, lets call it different and implement isTokenChar in a fast way without 3 additional branches. bq. I beg to differ on WDF This is coming from the fact that Solr is often misused because users just give up to think about tokenization. WDF only makes sense in product catalogues, but it is definitely broken for fulltext. The product catalogues are of course some of our customers, but before I suggest to them that they should use WhiteSpaceTokenizer with WordDestroyerFilter, I would analyzer their root problem (why is their tokenization broken). This is why I am against the broken example configs in Solr we had in the past. Because WST and WDF should really only be used as a last resort. > WhitespaceTokenizer should tokenize on NBSP > ------------------------------------------- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Reporter: David Smiley > Priority: Minor > Attachments: LUCENE-6874.patch > > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org