[
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-1488:
--------------------------------
Attachment: LUCENE-1488.patch
updated patch, not ready yet but you can see where i am going.
ICUTokenizer: Breaks text into words according to UAX #29: Unicode Text
Segmentation. Text is divided across script boundaries so that this
segmentation can be tailored for different writing systems; for example Thai
text is segmented with a different method. The default and script-specific
rules can be tailored. In the resources folder i have some examples for
Southeast Asian scripts, etc. Since i need script boundaries for tailoring, i
stuff the ISO 15924 script code constant in the flags; this could be useful for
downstream consumers.
ICUCaseFoldingFilter: Fold case according to Unicode Default Caseless Matching;
Full case folding. This may change the length of the token, for example german
sharp s is folded to 'ss'. This filter interacts with the downstream
normalization filter in a special way, so you can provide a hint as to what the
desired normalization form will be. In the NFKC or NFKD case it will apply the
NFKC_Closure set so you do not have to Normalize(Fold(Normalize(Fold(x))))
ICUDigitFoldingFilter: Standardize digits from different scripts to the latin
values, 0-9.
ICUFormatFilter: Remove identifier-ignorable codepoints, specifically those
from the Format category.
ICUNormalizationFilter: Apply unicode normalization to text. This is
accelerated with a quick-check.
ICUAnalyzer ties all this together. All of these components should also work
correctly with surrogate-pair data.
Needs more doc and tests. any comments appreciated.
> issues with standardanalyzer on multilingual text
> -------------------------------------------------
>
> Key: LUCENE-1488
> URL: https://issues.apache.org/jira/browse/LUCENE-1488
> Project: Lucene - Java
> Issue Type: Wish
> Components: contrib/analyzers
> Reporter: Robert Muir
> Priority: Minor
> Attachments: ICUAnalyzer.patch, LUCENE-1488.patch
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards
> to breaking text into words, especially with respect to non-alphabetic
> scripts. This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be
> working until i looked at the jflex rules and saw that codepoint range for
> most of the Thai block was added to the alphanum specification. defining the
> exact codepoint ranges like this for every language could help with the
> problem but you'd basically be reimplementing the bounds properties already
> stated in the unicode standard.
> in general it looks like this kind of behavior is bad in lucene for even
> latin, for instance, the analyzer will break words around accent marks in
> decomposed form. While most latin letter + accent combinations have composed
> forms in unicode, some do not. (this is also an issue for asciifoldingfilter
> i suppose).
> I've got a partially tested standardanalyzer that uses icu Rule-based
> BreakIterator instead of jflex. Using this method you can define word
> boundaries according to the unicode bounds properties. After getting it into
> some good shape i'd be happy to contribute it for contrib but I wonder if
> theres a better solution so that out of box lucene will be more friendly to
> non-ASCII text. Unfortunately it seems jflex does not support use of these
> properties such as [\p{Word_Break = Extend}] so this is probably the major
> barrier.
> Thanks,
> Robert
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]