[
https://issues.apache.org/jira/browse/LUCENE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978141#action_12978141
]
Steven Rowe commented on LUCENE-2847:
-------------------------------------
bq. We could also consolidate tools, because in general i would rather all the
analyzers be consolidated, they are only split up due to dependencies/large
files etc. But tools are different, its just to assist the build.
How far would you go with this tools consolidation? All tools across the whole
of Scenolunr? Or just the ones under {{modules/analysis/}}?
> Support all of unicode in StandardTokenizer
> -------------------------------------------
>
> Key: LUCENE-2847
> URL: https://issues.apache.org/jira/browse/LUCENE-2847
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Reporter: Robert Muir
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2847.patch, LUCENE-2847.patch, LUCENE-2847.patch
>
>
> StandardTokenizer currently only supports the BMP.
> If it encounters characters outside of the BMP, it just discards them...
> it should instead implement fully implement UAX#29 across all of unicode.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]