[ 
https://issues.apache.org/jira/browse/LUCENE-5013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664626#comment-13664626
 ] 

Karl Wettin commented on LUCENE-5013:
-------------------------------------

bq. I have one question though, whether it is too aggressive

You do indeed have a point I never thought of before. It makes a lot of sense 
to also go from ø,ö,oe->ø for those that are using a Scandinavian keyboard. 
This is a feature I too want now.

But the problem isn't just that we use ä and you use æ, it's native and non 
speakers sitting in front of the wrong sort of keyboard. Swedish people will 
most definitely in that situation write raksmorgas when searching for 
räksmörgås and most probably blabarsyltetoj when searching for blåbærssyltetøj, 
while my guess is that an American would write raksmorgas and blabaersyltetoj.
 

I ran a test too see how bad the Norwegian mismatches are using the "Norsk 
scrabbleforbund"-dictionary:

593526 Norwegian words in dictionary.
  4698 Norwegian mismatches using ScandinavianNormalizerFilter.
  3943 Norwegian mismatches using ASCIIFoldingFilter.

That's something like 0,6%-0,8%. I find that totally acceptable, but I also 
suppose it depends on how you implement your index. If you're indexing nothing 
but the folded text then it might be a problem, but if it's something secondary 
on a disjunction with a lower boost, then it's hopefully just a matter of a few 
extra CPU-cycles and FS-seeks.

                
> ScandinavianInterintelligableASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-5013
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5013
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.3
>            Reporter: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-5013.txt
>
>
> This filter is an augmentation of output from ASCIIFoldingFilter,
> it discriminate against double vowels aa, ae, ao, oe and oo, leaving just the 
> first one.
> blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej == blabarsyltetoj
> räksmörgås == ræksmørgås == ræksmörgaos == raeksmoergaas == raksmorgas
> Caveats:
> Since this is a filtering on top of ASCIIFoldingFilter äöåøæ already has been 
> folded down to aoaoae when handled by this filter it will cause effects such 
> as:
> bøen -> boen -> bon
> åene -> aene -> ane
> I find this to be a trivial problem compared to not finding anything at all.
> Background:
> Swedish åäö is in fact the same letters as Norwegian and Danish åæø and thus 
> interchangeable in when used between these languages. They are however folded 
> differently when people type them on a keyboard lacking these characters and 
> ASCIIFoldingFilter handle ä and æ differently.
> When a Swedish person is lacking umlauted characters on the keyboard they 
> consistently type a, a, o instead of å, ä, ö. Foreigners also tend to use a, 
> a, o.
> In Norway people tend to type aa, ae and oe instead of å, æ and ø. Some use 
> a, a, o. I've also seen oo, ao, etc. And permutations. Not sure about Denmark 
> but the pattern is probably the same.
> This filter solves that problem, but might also cause new.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to