[ 
https://issues.apache.org/jira/browse/LUCENE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-2847:
--------------------------------

    Attachment: LUCENE-2847.patch

New patch, with the following changes:

# Added a new target {{gen-uax29-supp-macros}} to 
{{modules/analysis/icu/build.xml}}, and a {{<subant>}} call to it from the 
{{jflex}} task in {{modules/analysis/common/build.xml}}.
# Included SUPPLEMENTARY.jflex-macro}} in {{UAX29URLEmailTokenizer.jflex}} in 
the same way as it is included in {{StandardTokenizer.jflex}}
# Copied the simple supplementary characters test from 
{{TestStandardAnalyzer.java}} to {{TestUAX29URLEmailTokenizer.java}}.
# Modified the CHANGES.txt entry for the UAX#29 issues to include a reference 
to this issue.

All tests pass.

> Support all of unicode in StandardTokenizer
> -------------------------------------------
>
>                 Key: LUCENE-2847
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2847
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Robert Muir
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2847.patch, LUCENE-2847.patch
>
>
> StandardTokenizer currently only supports the BMP.
> If it encounters characters outside of the BMP, it just discards them... 
> it should instead implement fully implement UAX#29 across all of unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to