[ https://issues.apache.org/jira/browse/LUCENE-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated LUCENE-2653: -------------------------------- Fix Version/s: 3.0.3 2.9.4 > ThaiAnalyzer assumes things about your jre > ------------------------------------------ > > Key: LUCENE-2653 > URL: https://issues.apache.org/jira/browse/LUCENE-2653 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/analyzers > Affects Versions: 3.1, 4.0 > Reporter: Robert Muir > Assignee: Robert Muir > Fix For: 2.9.4, 3.0.3, 3.1, 4.0 > > Attachments: LUCENE-2653.patch > > > The ThaiAnalyzer/ThaiWordFilter depends on the fact that > BreakIterator.getWordInstance(new Locale("th")) returns a dictionary-based > break iterator that can segment thai phrases into words (it does not use > whitespace). > But this is non-standard that the JRE will specialize this locale in this > way, its nice, but you can't depend on it. > For example, if you are running on IBM JRE, this analyzer/wordfilter is > completely "broken" in the sense it won't do what it claims to do. > At the minimum, we need to document this and suggest users look at > ICUTokenizer for thai, which always has this breakiterator and is not > jre-dependent. > Better, would be to check statically that the thing actually works. > when creating a new ThaiWordFilter we could clone() the BreakIterator, which > is often cheaper than making a new one anyway. > we could throw an exception, if its not supported, and add a boolean so the > user knows it works. > and we could refer to this boolean with Assert.assume in its tests. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org