[
https://issues.apache.org/jira/browse/LUCENE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001000#comment-15001000
]
Steve Rowe commented on LUCENE-6874:
------------------------------------
bq. My idea was to use a Unicode data file and extract all Whitespace
characters in a build tool. Shipping with the usicode data file would be a
large overhead.
The JFlex project has a similar requirement, but for many more properties than
just Whitespace. JFlex includes a Maven plugin used by the build that parses
Unicode data files via (you guessed it) JFlex scanners - here's the JFlex spec
for the parser for binary property data files, including {{PropList.txt}},
which holds the Whitespace property definition:
https://github.com/jflex-de/jflex/blob/master/jflex-unicode-maven-plugin/src/main/jflex/BinaryPropertiesFileScanner.flex
Note: Unicode property names can have aliases, and "loose" matching is the
recommended way to refer to them (see
http://unicode.org/reports/tr18/#Categories ): match case-insensitively, and
ignore whitespace, dashes, and underscores. {{PropList.txt}} gives the
Whitespace property name as {{White_Space}}, and {{PropertyAliases.txt}} lists
{{WSpace}} and {{space}} as aliases.
> WhitespaceTokenizer should tokenize on NBSP
> -------------------------------------------
>
> Key: LUCENE-6874
> URL: https://issues.apache.org/jira/browse/LUCENE-6874
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: David Smiley
> Priority: Minor
> Attachments: LUCENE-6874-jflex.patch, LUCENE-6874.patch,
> LUCENE_6874_jflex.patch
>
>
> WhitespaceTokenizer uses [Character.isWhitespace
> |http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-]
> to decide what is whitespace. Here's a pertinent excerpt:
> bq. It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or
> PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0',
> '\u2007', '\u202F')
> Perhaps Character.isWhitespace should have been called
> isLineBreakableWhitespace?
> I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to
> work around but why leave this trap in by default?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]