[ 
https://issues.apache.org/jira/browse/LUCENE-5013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665224#comment-13665224
 ] 

Karl Wettin commented on LUCENE-5013:
-------------------------------------

A nice comment appeared on java-users, I'm pasting it in here to gather 
everything in one place.



22 maj 2013 kl. 20:29 skrev Petite Abeille:


On May 22, 2013, at 7:08 PM, Karl Wettin <karl.wet...@kodapan.se> wrote:

* Use a filter after ASCIIFoldingFilter that discriminate all use of ae, oe, 
oo, and other combination of double vowels, just keeping the first one.

I ended up with that solution.

https://issues.apache.org/jira/browse/LUCENE-5013

Interesting problem… perhaps you could generalize your solution a bit… for 
example, in, say, German, one could substitute 'ue' for 'ü', etc… so it looks 
like what you are after is folding double vowels… irrespectively of how they 
got there…

So… assuming something along the lines of Sean M. Burke Unidecode [1] for the 
purpose of ASCII transliteration, what's left is simply to fold double vowels, 
e.g.:

print( 1, Unidecode( 'blåbærsyltetøj' ):lower():gsub( '([aeiou]?)([aeiou]?)', 
'%1' ) )
print( 2, Unidecode( 'blåbärsyltetöj' ):lower():gsub( '([aeiou]?)([aeiou]?)', 
'%1' ) )
print( 3, Unidecode( 'blaabaarsyltetoej' ):lower():gsub( 
'([aeiou]?)([aeiou]?)', '%1' ) )
print( 4, Unidecode( 'blabarsyltetoj' ):lower():gsub( '([aeiou]?)([aeiou]?)', 
'%1' ) )
print( 5, Unidecode( 'Räksmörgås' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' 
) )
print( 6, Unidecode( 'Göteborg' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 7, Unidecode( 'Gøteborg' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 8, Unidecode( 'Über' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 9, Unidecode( 'ueber' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 10, Unidecode( 'uber' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )
print( 11, Unidecode( 'uuber' ):lower():gsub( '([aeiou]?)([aeiou]?)', '%1' ) )

1       blabarsyltetoj
2       blabarsyltetoj
3       blabarsyltetoj
4       blabarsyltetoj
5       raksmorgas
6       goteborg
7       goteborg        
8       uber    
9       uber    
10      uber    
11      uber    



[1] http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm

                
> ScandinavianInterintelligableASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-5013
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5013
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.3
>            Reporter: Karl Wettin
>            Priority: Trivial
>         Attachments: LUCENE-5013.txt
>
>
> This filter is an augmentation of output from ASCIIFoldingFilter,
> it discriminate against double vowels aa, ae, ao, oe and oo, leaving just the 
> first one.
> blåbærsyltetøj == blåbärsyltetöj == blaabaarsyltetoej == blabarsyltetoj
> räksmörgås == ræksmørgås == ræksmörgaos == raeksmoergaas == raksmorgas
> Caveats:
> Since this is a filtering on top of ASCIIFoldingFilter äöåøæ already has been 
> folded down to aoaoae when handled by this filter it will cause effects such 
> as:
> bøen -> boen -> bon
> åene -> aene -> ane
> I find this to be a trivial problem compared to not finding anything at all.
> Background:
> Swedish åäö is in fact the same letters as Norwegian and Danish åæø and thus 
> interchangeable in when used between these languages. They are however folded 
> differently when people type them on a keyboard lacking these characters and 
> ASCIIFoldingFilter handle ä and æ differently.
> When a Swedish person is lacking umlauted characters on the keyboard they 
> consistently type a, a, o instead of å, ä, ö. Foreigners also tend to use a, 
> a, o.
> In Norway people tend to type aa, ae and oe instead of å, æ and ø. Some use 
> a, a, o. I've also seen oo, ao, etc. And permutations. Not sure about Denmark 
> but the pattern is probably the same.
> This filter solves that problem, but might also cause new.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to