[jira] Commented: (LUCENE-2847) Support all of unicode in StandardTokenizer

Steven Rowe (JIRA) Wed, 05 Jan 2011 21:22:15 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978141#action_12978141
 ]


Steven Rowe commented on LUCENE-2847:
-------------------------------------

bq. We could also consolidate tools, because in general i would rather all the 
analyzers be consolidated, they are only split up due to dependencies/large 
files etc. But tools are different, its just to assist the build.

How far would you go with this tools consolidation?  All tools across the whole 
of Scenolunr?  Or just the ones under {{modules/analysis/}}?

> Support all of unicode in StandardTokenizer
> -------------------------------------------
>
>                 Key: LUCENE-2847
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2847
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Robert Muir
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2847.patch, LUCENE-2847.patch, LUCENE-2847.patch
>
>
> StandardTokenizer currently only supports the BMP.
> If it encounters characters outside of the BMP, it just discards them... 
> it should instead implement fully implement UAX#29 across all of unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2847) Support all of unicode in StandardTokenizer

Reply via email to