[ https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14985598#comment-14985598 ]
David Smiley commented on LUCENE-6874: -------------------------------------- bq. So maybe we should solve this problem by adding some documentation? If the vast majority (like 90%+) of users that currently use WhitespaceTokenizer would want to tokenize on it, then I don't think documentation is sufficient at all. Documenting something most people would want to change is very very easy to overlook. That's what I call a _trap_; not that there might be some uses for the current behavior. Lucene should do what most users want it do do by default. As Jack said, the users of the search platform don't care what Java's definition of Character.isWhitespace is. I propose WhitespaceTokenizerFactory have a flag for this, and that it default to consider NBSP a space based on Lucene's Version. I get Uwe's point that there are other Tokenizers. But I disagree that WhitespaceTokenizer shouldn't be used for "classical full text". For example StandardTokenizer tokenizes on hypthen and thus foils some of the benefit of WordDelimiterFilter. Maybe ICUTokenizer is an answer; I haven't checked it's interaction with WDF. But why can't we just have a tokenizer that just tokenizes simply on all whitespace? I'll have to see the links Rob just posted; I haven't read them yet. > WhitespaceTokenizer should tokenize on NBSP > ------------------------------------------- > > Key: LUCENE-6874 > URL: https://issues.apache.org/jira/browse/LUCENE-6874 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Reporter: David Smiley > Priority: Minor > > WhitespaceTokenizer uses [Character.isWhitespace > |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-] > to decide what is whitespace. Here's a pertinent excerpt: > bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or > PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', > '\u2007', '\u202F') > Perhaps Character.isWhitespace should have been called > isLineBreakableWhitespace? > I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to > work around but why leave this trap in by default? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org