[jira] [Commented] (LUCENE-5357) Upgrade StandardTokenizer & co to latest unicode rules

Steve Rowe (JIRA) Thu, 05 Dec 2013 14:06:13 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-5357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13840624#comment-13840624
 ]


Steve Rowe commented on LUCENE-5357:
------------------------------------

Tasks include:

* Update the TLDs acceptable in URLs and Emails (for 
{{UAX29URLEmailTokenizer}}) from the latest IANA Root Zone Database, using 
{{ant gen-tlds}}.  Test data files referring to obsolete TLDs will need to be 
updated to use current TLDs: 
{{((email.addresses,urls).from.)random.text.with.(email.address,urls).txt}}.
* Update the icu module's {{GenerateJFlexSupplementaryMacros.java}} to include 
supplementary character additions to JFlex grammars for new character classes 
{{\[:WordBreak=Single_Quote:]}}, {{\[:WordBreak=Double_Quote:]}} and 
{{\[:WordBreak=Hebrew_Letter:]}}.
* Update the JFlex grammars to Unicode 6.3
** Change the version in the {{%unicode}} directive in the grammar: {{%unicode 
6.1}} -> {{%unicode 6.3}}
** Upgrade the UAX#29-based grammars to the Unicode 6.3 word break rules, in 
{{StandardTokenizerImpl.jflex}} and {{UAX29URLEmailTokenizer.jflex}}.
* Regenerate the JFlex scanners in {{lucene/analysis/common/}} via {{ant 
jflex}}.
* Test the new scanners against the Unicode 6.3 word break test data
** Update {{generateJavaUnicodeWordBreakTest.pl}} to handle above-BMP 
characters in the Unicode character database's 
{{ucd/auxiliary/WordBreakTest.txt}} (previous Unicode versions included only 
BMP characters in that file).
** Using {{generateJavaUnicodeWordBreakTest.pl}}, generate 
{{WordBreakTestUnicode_6_3_0.java}} under 
{{modules/analysis/common/src/test/org/apache/lucene/analysis/core/}}.
** Update {{TestStandardAnalyzer.java}} and {{TestUAX29URLEmailTokenizer.java}} 
to invoke {{WordBreakTestUnicode_6_3_0}} rather than 
{{WordBreakTestUnicode_6_1_0}}.
** Remove {{WordBreakTestUnicode_6_1_0.java}}.

Additional task for the 4.x backport:

* Version the JFlex grammars: 
** Copy the current implementations to *Impl40 (where 40=>4.0 is the version in 
which the Unicode 6.1 versions of these scanners were introduced.
** Cause the versioning tokenizer wrappers to instantiate this version when the 
Version c-tor param is in the range 4.0 to 4.6.
** Change the specified Unicode version in the non-versioned JFlex grammars 
from 6.1 to 6.3.


> Upgrade StandardTokenizer & co to latest unicode rules
> ------------------------------------------------------
>
>                 Key: LUCENE-5357
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5357
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Robert Muir
>
> besides any change in data, the rules have also changed (regional indicators, 
> better handling for hebrew, etc)



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5357) Upgrade StandardTokenizer & co to latest unicode rules

Reply via email to