[ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated LUCENE-6874:
---------------------------------
    Attachment: LUCENE_6874_jflex.patch

Here's an updated patch working from Steve's:
* The existing WhitespaceTokenizerFactory can be configured via the "rule" 
parameter to use the "unicode" rule or the "java" (default) rule.  The 
maxTokenLength parameter is now here too.  If you use the "java" rule then 
maxTokenLength, if specified, is only permitted to be 255.  The 
UnicodeWhitespaceTokenizerFactory was removed since it's now combined.
** added a simple testFactory test for this factory
* Tweaked the javadocs as I mentioned.
* Removed some of the test methods Steve added that were actually not tests but 
performance measurements (that also wrote to stderr).
* Resolved various pre-commit issues (ASL header, svn props)

If I hear no more feedback then I plan to commit Tuesday night (EST)

> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
>                 Key: LUCENE-6874
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6874
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Priority: Minor
>         Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to