[ https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703598#action_12703598 ]
uday kumar maddigatla commented on LUCENE-1488: ----------------------------------------------- hi, i too just facing the same problem. my documet contains english as well as danish elements. I tried to use this analyzer. when i try to use this i got this error . Exception in thread "main" java.lang.ExceptionInInitializerError at org.apache.lucene.analysis.icu.ICUAnalyzer.tokenStream(ICUAnalyzer.java:74) at org.apache.lucene.analysis.Analyzer.reusableTokenStream(Analyzer.java:48) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:117) at org.apache.lucene.index.DocFieldConsumersPerField.processFields(DocFieldConsumersPerField.java:36) at org.apache.lucene.index.DocFieldProcessorPerThread.processDocument(DocFieldProcessorPerThread.java:234) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:765) at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:743) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1918) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1895) at com.IndexFiles.indexDocs(IndexFiles.java:87) at com.IndexFiles.indexDocs(IndexFiles.java:80) at com.IndexFiles.main(IndexFiles.java:57) Caused by: java.lang.IllegalArgumentException: Error 66063 at line 2 column 17 at com.ibm.icu.text.RBBIRuleScanner.error(RBBIRuleScanner.java:505) at com.ibm.icu.text.RBBIRuleScanner.scanSet(RBBIRuleScanner.java:1047) at com.ibm.icu.text.RBBIRuleScanner.doParseActions(RBBIRuleScanner.java:484) at com.ibm.icu.text.RBBIRuleScanner.parse(RBBIRuleScanner.java:912) at com.ibm.icu.text.RBBIRuleBuilder.compileRules(RBBIRuleBuilder.java:298) at com.ibm.icu.text.RuleBasedBreakIterator.compileRules(RuleBasedBreakIterator.java:316) at com.ibm.icu.text.RuleBasedBreakIterator.<init>(RuleBasedBreakIterator.java:71) at org.apache.lucene.analysis.icu.ICUBreakIterator.<init>(ICUBreakIterator.java:53) at org.apache.lucene.analysis.icu.ICUBreakIterator.<init>(ICUBreakIterator.java:45) at org.apache.lucene.analysis.icu.ICUTokenizer.<clinit>(ICUTokenizer.java:58) ... 12 more please help me in this. > issues with standardanalyzer on multilingual text > ------------------------------------------------- > > Key: LUCENE-1488 > URL: https://issues.apache.org/jira/browse/LUCENE-1488 > Project: Lucene - Java > Issue Type: Wish > Components: contrib/analyzers > Reporter: Robert Muir > Priority: Minor > Attachments: ICUAnalyzer.patch > > > The standard analyzer in lucene is not exactly unicode-friendly with regards > to breaking text into words, especially with respect to non-alphabetic > scripts. This is because it is unaware of unicode bounds properties. > I actually couldn't figure out how the Thai analyzer could possibly be > working until i looked at the jflex rules and saw that codepoint range for > most of the Thai block was added to the alphanum specification. defining the > exact codepoint ranges like this for every language could help with the > problem but you'd basically be reimplementing the bounds properties already > stated in the unicode standard. > in general it looks like this kind of behavior is bad in lucene for even > latin, for instance, the analyzer will break words around accent marks in > decomposed form. While most latin letter + accent combinations have composed > forms in unicode, some do not. (this is also an issue for asciifoldingfilter > i suppose). > I've got a partially tested standardanalyzer that uses icu Rule-based > BreakIterator instead of jflex. Using this method you can define word > boundaries according to the unicode bounds properties. After getting it into > some good shape i'd be happy to contribute it for contrib but I wonder if > theres a better solution so that out of box lucene will be more friendly to > non-ASCII text. Unfortunately it seems jflex does not support use of these > properties such as [\p{Word_Break = Extend}] so this is probably the major > barrier. > Thanks, > Robert -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org