[ https://issues.apache.org/jira/browse/LUCENE-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081374#comment-13081374 ]
Robert Muir commented on LUCENE-3366: ------------------------------------- the purpose of the filter is "Normalizes tokens extracted with StandardTokenizer". currently this is a no-op, but we can always improve it going with the spirit of the whole standard this thing implements. The TODO currently refers to this statement: "For Thai, Lao, Khmer, Myanmar, and other scripts that do not use typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism ... Ideographic scripts such as Japanese and Chinese are even more complex" There is no problem having a TODO in this filter, we don't need to do a rush job for any reason... Some of the preparation for this (e.g. improving the default behavior for CJK) was already done in LUCENE-2911. We now tag all these special types, so in the meantime if someone wants to do their own downstream processing they can do this themselves. > StandardFilter only works with ClassicTokenizer and only when version < 3.1 > --------------------------------------------------------------------------- > > Key: LUCENE-3366 > URL: https://issues.apache.org/jira/browse/LUCENE-3366 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis > Affects Versions: 3.3 > Reporter: David Smiley > > The StandardFilter used to remove periods from acronyms and apostrophes-S's > where they occurred. And it used to work in conjunction with the > StandardTokenizer. Presently, it only does this with ClassicTokenizer and > when the lucene match version is before 3.1. Here is a excerpt from the code: > {code:lang=java} > public final boolean incrementToken() throws IOException { > if (matchVersion.onOrAfter(Version.LUCENE_31)) > return input.incrementToken(); // TODO: add some niceties for the new > grammar > else > return incrementTokenClassic(); > } > {code} > It seems to me that in the great refactor of the standard tokenizer, > LUCENE-2167, something was forgotten here. I think that if someone uses the > ClassicTokenizer then no matter what the version is, this filter should do > what it used to do. And the TODO suggests someone forgot to make this filter > do something useful for the StandardTokenizer. Or perhaps that idea should > be discarded and this class should be named ClassicTokenFilter. > In any event, the javadocs for this class appear out of date as there is no > mention of ClassicTokenizer, and the wiki is out of date too. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org