Re: char_filter for German

Andrej Rosenheinrich Wed, 04 Feb 2015 06:45:55 -0800

Hello Jörg,

could you maybe share the configuration for the german_normalize analyzer 
without stemming? I actually only need the umlaut expansion. And what do 
you mean by "at the right places in words" for snowball?


Thanks!
Andrej

Am Sonntag, 30. November 2014 17:20:16 UTC+1 schrieb Jörg Prante:
>
> Do not use regex, this will give wrong results.
>
> Elasticsearch comes with full support for german umlaut handling.
>
> If you install ICU plugin, you can use something like this analysis setting
>
> {
>     "index" : {
>         "analysis" : {
>             "filter" : {
>                 "german_normalize_stem" : {
>                   "type" : "snowball",
>                   "name" : "German2"
>                 }
>             },
>             "analyzer" : {
>                 "stemmed" : {
>                     "type" : "custom",
>                     "tokenizer" : "standard",
>                     "filter" : [
>                         "lowercase",
>                         "icu_normalizer",
>                         "icu_folding",
>                         "german_normalize_stem"
>                     ]
>                 },
>                 "unstemmed" : {
>                     "type" : "custom",
>                     "tokenizer" : "standard",
>                     "filter" : [
>                         "lowercase",
>                         "icu_normalizer",
>                         "icu_folding",
>                         "german_normalize"
>                     ]
>                 }
>             }
>         }
>     }
> }
>
> ICU handles german umlauts, and also case folding like "ss" and "ß".
>
> Snowball handles umlaut expansions (ae, oe, ue) at the right places in 
> words.
>
> You can choose between stemmed and unstemmed analysis. Snowball tends to 
> overstem words. The "german_normalize" token filter is copied from Snowball 
> but works without stemming.
>
> The effect of the combination is that all german words like Jörg,  Joerg, 
> Jorg are reduced to jorg in the index.
>
> Best,
>
> Jörg
>
>
> On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan <[email protected] 
> <javascript:>> wrote:
>
>> Hi Jürgen,
>>
>> Currently we don't have big volumes of data to index so we would like to 
>> yield more results in hope that proper ones would still be shown in the 
>> top. In future, when we have more data, we'll have to sacrifice some use 
>> cases in order to provide more precise results for the rest of users. 
>>
>> I think I will try regexp token approach to replace umlauts with "e" 
>> forms to solve this double expansion problem. 
>>
>> Best,
>>
>> Krešimir
>>
>> On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT) 
>> wrote:
>>>
>>>  Hi Krešimir,
>>>   the correct term is "über" (over, above) or "hören" (hear) or "ändern" 
>>> (change). When you cannot write umlauts, the correct alternative spelling 
>>> in print is "ueber", "hoeren", "aendern". Everybody can write this in 
>>> ASCII. However, those who are possibly non-speakers of German who still 
>>> want to search for German terms are usually not aware of this and believe 
>>> it's like with accents in French, where "á" is lexically treated like "a". 
>>> Those users are wrong in spelling "uber", "horen", "andern" because "u" and 
>>> "ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE 
>>> letter :-)
>>>
>>> However, in order to provide a convenience to those users as well,  you 
>>> could decide that - to yield at least some meaningful results - you will 
>>> also consider the versions without the umlaut dots equivalent. In that 
>>> case, you want to map any token containing an umlaut (ä, ö, ü) to three 
>>> alternatives: umlaut, without umlaut marker, alternative spelling with 'e'. 
>>> This won't let you distinguish between the "Bar" (bar, the place to get a 
>>> drink) and "Bär" (bear, the one giving you a great, dangerous hug). 
>>> "Forderung" (demand) and "Förderung" (encouragement, facilitation, 
>>> promotion, extraction [geol.]) are also quite different, just to give a few 
>>> examples.
>>>
>>> For the proper recognition of those terms, you would normally use a 
>>> dictionary of German, including some frequent proper names as well. So, if 
>>> you look for "clown boll", you would not only get "Der Clown im Advent - 
>>> Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines 
>>> Clowns", because the query would be transformed into "clown AND (boll OR 
>>> boell OR böll)" as "boll" matches an umlaut candidate in your dictionary. 
>>> If you dare to normalize your indexed texts, so "Boell" would already have 
>>> been turned into "Böll", you could even do with a disjunction of only the 
>>> one correct form and the misspelling. Again, however, you would make use of 
>>> a dictionary to perform such normalization. Ideally, you would even have a 
>>> POS tagger in place, so you would only make such replacements where the 
>>> name Böll is referred to, not the city of Bad Boll.
>>>
>>> It's a question of how much effort makes sense for your application. If 
>>> you just want to index some German text, maybe you just want to turn all 
>>> umlauts into the plain vocals for the purpose of indexing, but still keep 
>>> the reference to the original for result display. Maybe that's sufficient. 
>>> For larger volumes of documents, a more precise approach is recommended to 
>>> avoid false positives.
>>>
>>> Cheers,
>>> --Jürgen
>>>
>>>
>>> On 29.11.2014 20:35, Krešimir Slugan wrote:
>>>  
>>> Because, as far as I understand, in German it's semantically the same to 
>>> write über or ueber (although ueber is less often used). I guess this is 
>>> not true only for personal names. 
>>> Syntactically, "uber" is wrong but users sometimes search for this also.
>>>  
>>>  
>>> -- 
>>>
>>> Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С 
>>> уважением
>>> *i.A. Jürgen Wagner*
>>> Head of Competence Center "Intelligence"
>>> & Senior Cloud Consultant 
>>>
>>> Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
>>> Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 
>>> 1543
>>> E-Mail: [email protected], URL: www.devoteam.de
>>> ------------------------------
>>> Managing Board: Jürgen Hatzipantelis (CEO)
>>> Address of Record: 64331 Weiterstadt, Germany; Commercial Register: 
>>> Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071 
>>>
>>>
>>>    -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/0b7484e8-5752-4bf4-878f-342abadbc5d5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: char_filter for German

Reply via email to