[ https://issues.apache.org/jira/browse/LUCENE-5013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664302#comment-13664302 ]
Karl Wettin commented on LUCENE-5013: ------------------------------------- I do indeed believe that this does something different, at least as far as I can see. Example: People in Norway would spell the Swedish village of Särdal as Særdal, but when lacking those characters on their keyboard they would write Saerdal. In Sweden people would write Sardal. ASCIIFoldingFilter and friends would fold æ as ae and ä as a. The mismatch is primarily when a query contains the folded text, such as Saerdal. Folding all ä:s to ae will cause problem for people that just writes an a rather than ä. The same sort of mismatch will occur for å->aa, å->a, å->ao, ø->oe, ö->o. People tend to use different permutations of these alternatives and this filter normalizes it. So this is a filter that solves mismatching on ASCII folds for people in Norway and Denmark searching in a Swedish index and vice verse. See what I mean? > ScandinavianInterintelligableASCIIFoldingFilter > ----------------------------------------------- > > Key: LUCENE-5013 > URL: https://issues.apache.org/jira/browse/LUCENE-5013 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Affects Versions: 4.3 > Reporter: Karl Wettin > Priority: Trivial > Attachments: LUCENE-5013.txt > > > This filter is an augmentation of output from ASCIIFoldingFilter, > it discriminate against double vowels aa, ae, ao, oe and oo, leaving just the > first one. > blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej == blabarsyltetoj > räksmörgås == ræksmørgås == ræksmörgaos == raeksmoergaas == raksmorgas > Caveats: > Since this is a filtering on top of ASCIIFoldingFilter äöåøæ already has been > folded down to aoaoae when handled by this filter it will cause effects such > as: > bøen -> boen -> bon > åene -> aene -> ane > I find this to be a trivial problem compared to not finding anything at all. > Background: > Swedish åäö is in fact the same letters as Norwegian and Danish åæø and thus > interchangeable in when used between these languages. They are however folded > differently when people type them on a keyboard lacking these characters and > ASCIIFoldingFilter handle ä and æ differently. > When a Swedish person is lacking umlauted characters on the keyboard they > consistently type a, a, o instead of å, ä, ö. Foreigners also tend to use a, > a, o. > In Norway people tend to type aa, ae and oe instead of å, æ and ø. Some use > a, a, o. I've also seen oo, ao, etc. And permutations. Not sure about Denmark > but the pattern is probably the same. > This filter solves that problem, but might also cause new. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org