[jira] Commented: (LUCENE-2847) Support all of unicode in StandardTokenizer

Robert Muir (JIRA) Wed, 05 Jan 2011 15:29:10 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12978037#action_12978037
 ]


Robert Muir commented on LUCENE-2847:
-------------------------------------

{quote}
If we add a target in modules/analysis/icu/build.xml to run 
GenerateJFlexSupplementaryMacros#main(), maybe named gen-stdtok-supp-macros, 
the jflex target in modules/analysis/common/build.xml could use a <subant> to 
call it and auto-generate SUPPLEMENTARY.jflex-macro, no?
{quote}

Yeah, i think we could do something like this. We could also consolidate tools, 
because in general i would rather all the analyzers
be consolidated, they are only split up due to dependencies/large files etc. 
But tools are different, its just to assist the build.

> Support all of unicode in StandardTokenizer
> -------------------------------------------
>
>                 Key: LUCENE-2847
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2847
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Robert Muir
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2847.patch
>
>
> StandardTokenizer currently only supports the BMP.
> If it encounters characters outside of the BMP, it just discards them... 
> it should instead implement fully implement UAX#29 across all of unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2847) Support all of unicode in StandardTokenizer

Reply via email to