[ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated LUCENE-1488: -------------------------------- Attachment: LUCENE-1488.patch updated patch, not ready yet but you can see where i am going. ICUTokenizer: Breaks text into words according to UAX #29: Unicode Text Segmentation. Text is divided across script boundaries so that this segmentation can be tailored for different writing systems; for example Thai text is segmented with a different method. The default and script-specific rules can be tailored. In the resources folder i have some examples for Southeast Asian scripts, etc. Since i need script boundaries for tailoring, i stuff the ISO 15924 script code constant in the flags; this could be useful for downstream consumers. ICUCaseFoldingFilter: Fold case according to Unicode Default Caseless Matching; Full case folding. This may change the length of the token, for example german sharp s is folded to 'ss'. This filter interacts with the downstream normalization filter in a special way, so you can provide a hint as to what the desired normalization form will be. In the NFKC or NFKD case it will apply the NFKC_Closure set so you do not have to Normalize(Fold(Normalize(Fold(x)))) ICUDigitFoldingFilter: Standardize digits from different scripts to the latin values, 0-9. ICUFormatFilter: Remove identifier-ignorable codepoints, specifically those from the Format category. ICUNormalizationFilter: Apply unicode normalization to text. This is accelerated with a quick-check. ICUAnalyzer ties all this together. All of these components should also work correctly with surrogate-pair data. Needs more doc and tests. any comments appreciated. > issues with standardanalyzer on multilingual text > ------------------------------------------------- > > Key: LUCENE-1488 > URL: https://issues.apache.org/jira/browse/LUCENE-1488 > Project: Lucene - Java > Issue Type: Wish > Components: contrib/analyzers > Reporter: Robert Muir > Priority: Minor > Attachments: ICUAnalyzer.patch, LUCENE-1488.patch > > > The standard analyzer in lucene is not exactly unicode-friendly with regards > to breaking text into words, especially with respect to non-alphabetic > scripts. This is because it is unaware of unicode bounds properties. > I actually couldn't figure out how the Thai analyzer could possibly be > working until i looked at the jflex rules and saw that codepoint range for > most of the Thai block was added to the alphanum specification. defining the > exact codepoint ranges like this for every language could help with the > problem but you'd basically be reimplementing the bounds properties already > stated in the unicode standard. > in general it looks like this kind of behavior is bad in lucene for even > latin, for instance, the analyzer will break words around accent marks in > decomposed form. While most latin letter + accent combinations have composed > forms in unicode, some do not. (this is also an issue for asciifoldingfilter > i suppose). > I've got a partially tested standardanalyzer that uses icu Rule-based > BreakIterator instead of jflex. Using this method you can define word > boundaries according to the unicode bounds properties. After getting it into > some good shape i'd be happy to contribute it for contrib but I wonder if > theres a better solution so that out of box lucene will be more friendly to > non-ASCII text. Unfortunately it seems jflex does not support use of these > properties such as [\p{Word_Break = Extend}] so this is probably the major > barrier. > Thanks, > Robert -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org