[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Steve Rowe (JIRA) Wed, 11 Nov 2015 12:12:50 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001000#comment-15001000
 ]


Steve Rowe commented on LUCENE-6874:
------------------------------------

bq. My idea was to use a Unicode data file and extract all Whitespace 
characters in a build tool. Shipping with the usicode data file would be a 
large overhead.

The JFlex project has a similar requirement, but for many more properties than 
just Whitespace.  JFlex includes a Maven plugin used by the build that parses 
Unicode data files via (you guessed it) JFlex scanners - here's the JFlex spec 
for the parser for binary property data files, including {{PropList.txt}}, 
which holds the Whitespace property definition: 
https://github.com/jflex-de/jflex/blob/master/jflex-unicode-maven-plugin/src/main/jflex/BinaryPropertiesFileScanner.flex
 

Note: Unicode property names can have aliases, and "loose" matching is the 
recommended way to refer to them (see 
http://unicode.org/reports/tr18/#Categories ): match case-insensitively, and 
ignore whitespace, dashes, and underscores.  {{PropList.txt}} gives the 
Whitespace property name as {{White_Space}}, and {{PropertyAliases.txt}} lists 
{{WSpace}} and {{space}} as aliases.

> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
>                 Key: LUCENE-6874
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6874
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: David Smiley
>            Priority: Minor
>         Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch, 
> LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace 
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
>  to decide what is whitespace.  Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or 
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', 
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called 
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this.  I am aware it's easy to 
> work around but why leave this trap in by default?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6874) WhitespaceTokenizer should tokenize on NBSP

Reply via email to