[
https://issues.apache.org/jira/browse/LUCENE-5357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13840624#comment-13840624
]
Steve Rowe commented on LUCENE-5357:
------------------------------------
Tasks include:
* Update the TLDs acceptable in URLs and Emails (for
{{UAX29URLEmailTokenizer}}) from the latest IANA Root Zone Database, using
{{ant gen-tlds}}. Test data files referring to obsolete TLDs will need to be
updated to use current TLDs:
{{((email.addresses,urls).from.)random.text.with.(email.address,urls).txt}}.
* Update the icu module's {{GenerateJFlexSupplementaryMacros.java}} to include
supplementary character additions to JFlex grammars for new character classes
{{\[:WordBreak=Single_Quote:]}}, {{\[:WordBreak=Double_Quote:]}} and
{{\[:WordBreak=Hebrew_Letter:]}}.
* Update the JFlex grammars to Unicode 6.3
** Change the version in the {{%unicode}} directive in the grammar: {{%unicode
6.1}} -> {{%unicode 6.3}}
** Upgrade the UAX#29-based grammars to the Unicode 6.3 word break rules, in
{{StandardTokenizerImpl.jflex}} and {{UAX29URLEmailTokenizer.jflex}}.
* Regenerate the JFlex scanners in {{lucene/analysis/common/}} via {{ant
jflex}}.
* Test the new scanners against the Unicode 6.3 word break test data
** Update {{generateJavaUnicodeWordBreakTest.pl}} to handle above-BMP
characters in the Unicode character database's
{{ucd/auxiliary/WordBreakTest.txt}} (previous Unicode versions included only
BMP characters in that file).
** Using {{generateJavaUnicodeWordBreakTest.pl}}, generate
{{WordBreakTestUnicode_6_3_0.java}} under
{{modules/analysis/common/src/test/org/apache/lucene/analysis/core/}}.
** Update {{TestStandardAnalyzer.java}} and {{TestUAX29URLEmailTokenizer.java}}
to invoke {{WordBreakTestUnicode_6_3_0}} rather than
{{WordBreakTestUnicode_6_1_0}}.
** Remove {{WordBreakTestUnicode_6_1_0.java}}.
Additional task for the 4.x backport:
* Version the JFlex grammars:
** Copy the current implementations to *Impl40 (where 40=>4.0 is the version in
which the Unicode 6.1 versions of these scanners were introduced.
** Cause the versioning tokenizer wrappers to instantiate this version when the
Version c-tor param is in the range 4.0 to 4.6.
** Change the specified Unicode version in the non-versioned JFlex grammars
from 6.1 to 6.3.
> Upgrade StandardTokenizer & co to latest unicode rules
> ------------------------------------------------------
>
> Key: LUCENE-5357
> URL: https://issues.apache.org/jira/browse/LUCENE-5357
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: Robert Muir
>
> besides any change in data, the rules have also changed (regional indicators,
> better handling for hebrew, etc)
--
This message was sent by Atlassian JIRA
(v6.1#6144)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]