[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Simon Willnauer (JIRA) Tue, 16 Jun 2009 08:45:31 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720197#action_12720197
 ]


Simon Willnauer commented on LUCENE-1696:
-----------------------------------------


bq. simon, actually i think its documented you can use ENGLISH collator and it 
will behave like asciifolding filter (simply remove all diacritics).
you could then apply the tailorings like the example and get the behavior you 
want, versus maintaining a custom asciifoldingfilter... 
will try, thanks!

> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Mark Miller
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch, 
> TestGermanCollation.java
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java and 
> extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about 
> this filter. ASCIIFoldingFitler is meant to be a replacement for 
> ISOLatin1AccentFilter which is quite nice as it covers a superset of the 
> latter. I have used this filter quite often but never on a as it is basis. In 
> the most cases this filter does the correct thing (replace a special char 
> with its ascii correspondent) but in some cases like for German umlaut it 
> does not return the expected result. A german umlaut  like 'ä' does not 
> translate to a but rather to 'ae'. I would like to change this but I'n not 
> 100% sure if that is expected by all users of that filter. Another way of 
> doing it would be to make it configurable with a flag. This would not affect 
> performance as we only check if such a umlaut char is found. 
> Further it would be really helpful if that filter could "inject" the 
> original/unmodified token with the same position increment into the token 
> stream on demand. I think its a valid use-case to index the modified and 
> unmodified token. For instance, the german word "süd" would be folded to 
> "sud". In a query q:(süd) the filter would also fold to sud and therefore 
> find sud which has a totally different meaning. Folding works quite well but 
> for special cases would could add those options to make users life easier. 
> The latter could be done in a subclass while the umlaut problem should be 
> fixed in the base class.
> simon 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1696) Added New Token API impl for ASCIIFoldingFilter

Reply via email to