[ 
https://issues.apache.org/jira/browse/LUCENE-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081374#comment-13081374
 ] 

Robert Muir commented on LUCENE-3366:
-------------------------------------

the purpose of the filter is "Normalizes tokens extracted with 
StandardTokenizer".

currently this is a no-op, but we can always improve it going with the spirit 
of the whole standard this thing implements.

The TODO currently refers to this statement:
"For Thai, Lao, Khmer, Myanmar, and other scripts that do not use typically use 
spaces between words, a good implementation should not depend on the default 
word boundary specification. It should use a more sophisticated mechanism ... 
Ideographic scripts such as Japanese and Chinese are even more complex"

There is no problem having a TODO in this filter, we don't need to do a rush 
job for any reason... 

Some of the preparation for this (e.g. improving the default behavior for CJK) 
was already done in LUCENE-2911. We now tag all these special types,
so in the meantime if someone wants to do their own downstream processing they 
can do this themselves.


> StandardFilter only works with ClassicTokenizer and only when version < 3.1
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3366
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3366
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.3
>            Reporter: David Smiley
>
> The StandardFilter used to remove periods from acronyms and apostrophes-S's 
> where they occurred. And it used to work in conjunction with the 
> StandardTokenizer.  Presently, it only does this with ClassicTokenizer and 
> when the lucene match version is before 3.1. Here is a excerpt from the code:
> {code:lang=java}
>   public final boolean incrementToken() throws IOException {
>     if (matchVersion.onOrAfter(Version.LUCENE_31))
>       return input.incrementToken(); // TODO: add some niceties for the new 
> grammar
>     else
>       return incrementTokenClassic();
>   }
> {code}
> It seems to me that in the great refactor of the standard tokenizer, 
> LUCENE-2167, something was forgotten here. I think that if someone uses the 
> ClassicTokenizer then no matter what the version is, this filter should do 
> what it used to do. And the TODO suggests someone forgot to make this filter 
> do something useful for the StandardTokenizer.  Or perhaps that idea should 
> be discarded and this class should be named ClassicTokenFilter.
> In any event, the javadocs for this class appear out of date as there is no 
> mention of ClassicTokenizer, and the wiki is out of date too.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to