Re: char_filter for German

Krešimir Slugan Thu, 12 Mar 2015 01:37:12 -0700

Where is this "german_normalize" filter coming from? It solves my problem 
completely and magically but it's not documented anywhere (and seems like 
it's not part of ICU plugin either).


 

What is also weird is that filter can not be used in global context, e.g. 
it's not possible to try something like this: 

curl -XGET 
'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize'
 
-d 'this is a test'

but it is possible to use it in index context:

curl -XGET 
'localhost:9200/test_index/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize'
 
-d 'this is a test'


In first case I get "*ElasticsearchIllegalArgumentException[failed to find 
global token filter under [german_normalize]]*"


On Sunday, November 30, 2014 at 5:20:16 PM UTC+1, Jörg Prante wrote:
>
> Do not use regex, this will give wrong results.
>
> Elasticsearch comes with full support for german umlaut handling.
>
> If you install ICU plugin, you can use something like this analysis setting
>
> {
>     "index" : {
>         "analysis" : {
>             "filter" : {
>                 "german_normalize_stem" : {
>                   "type" : "snowball",
>                   "name" : "German2"
>                 }
>             },
>             "analyzer" : {
>                 "stemmed" : {
>                     "type" : "custom",
>                     "tokenizer" : "standard",
>                     "filter" : [
>                         "lowercase",
>                         "icu_normalizer",
>                         "icu_folding",
>                         "german_normalize_stem"
>                     ]
>                 },
>                 "unstemmed" : {
>                     "type" : "custom",
>                     "tokenizer" : "standard",
>                     "filter" : [
>                         "lowercase",
>                         "icu_normalizer",
>                         "icu_folding",
>                         "german_normalize"
>                     ]
>                 }
>             }
>         }
>     }
> }
>
> ICU handles german umlauts, and also case folding like "ss" and "ß".
>
> Snowball handles umlaut expansions (ae, oe, ue) at the right places in 
> words.
>
> You can choose between stemmed and unstemmed analysis. Snowball tends to 
> overstem words. The "german_normalize" token filter is copied from Snowball 
> but works without stemming.
>
> The effect of the combination is that all german words like Jörg,  Joerg, 
> Jorg are reduced to jorg in the index.
>
> Best,
>
> Jörg
>
>
> On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan <[email protected] 
> <javascript:>> wrote:
>
>> Hi Jürgen,
>>
>> Currently we don't have big volumes of data to index so we would like to 
>> yield more results in hope that proper ones would still be shown in the 
>> top. In future, when we have more data, we'll have to sacrifice some use 
>> cases in order to provide more precise results for the rest of users. 
>>
>> I think I will try regexp token approach to replace umlauts with "e" 
>> forms to solve this double expansion problem. 
>>
>> Best,
>>
>> Krešimir
>>
>> On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT) 
>> wrote:
>>>
>>>  Hi Krešimir,
>>>   the correct term is "über" (over, above) or "hören" (hear) or "ändern" 
>>> (change). When you cannot write umlauts, the correct alternative spelling 
>>> in print is "ueber", "hoeren", "aendern". Everybody can write this in 
>>> ASCII. However, those who are possibly non-speakers of German who still 
>>> want to search for German terms are usually not aware of this and believe 
>>> it's like with accents in French, where "á" is lexically treated like "a". 
>>> Those users are wrong in spelling "uber", "horen", "andern" because "u" and 
>>> "ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE 
>>> letter :-)
>>>
>>> However, in order to provide a convenience to those users as well,  you 
>>> could decide that - to yield at least some meaningful results - you will 
>>> also consider the versions without the umlaut dots equivalent. In that 
>>> case, you want to map any token containing an umlaut (ä, ö, ü) to three 
>>> alternatives: umlaut, without umlaut marker, alternative spelling with 'e'. 
>>> This won't let you distinguish between the "Bar" (bar, the place to get a 
>>> drink) and "Bär" (bear, the one giving you a great, dangerous hug). 
>>> "Forderung" (demand) and "Förderung" (encouragement, facilitation, 
>>> promotion, extraction [geol.]) are also quite different, just to give a few 
>>> examples.
>>>
>>> For the proper recognition of those terms, you would normally use a 
>>> dictionary of German, including some frequent proper names as well. So, if 
>>> you look for "clown boll", you would not only get "Der Clown im Advent - 
>>> Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines 
>>> Clowns", because the query would be transformed into "clown AND (boll OR 
>>> boell OR böll)" as "boll" matches an umlaut candidate in your dictionary. 
>>> If you dare to normalize your indexed texts, so "Boell" would already have 
>>> been turned into "Böll", you could even do with a disjunction of only the 
>>> one correct form and the misspelling. Again, however, you would make use of 
>>> a dictionary to perform such normalization. Ideally, you would even have a 
>>> POS tagger in place, so you would only make such replacements where the 
>>> name Böll is referred to, not the city of Bad Boll.
>>>
>>> It's a question of how much effort makes sense for your application. If 
>>> you just want to index some German text, maybe you just want to turn all 
>>> umlauts into the plain vocals for the purpose of indexing, but still keep 
>>> the reference to the original for result display. Maybe that's sufficient. 
>>> For larger volumes of documents, a more precise approach is recommended to 
>>> avoid false positives.
>>>
>>> Cheers,
>>> --Jürgen
>>>
>>>
>>> On 29.11.2014 20:35, Krešimir Slugan wrote:
>>>  
>>> Because, as far as I understand, in German it's semantically the same to 
>>> write über or ueber (although ueber is less often used). I guess this is 
>>> not true only for personal names. 
>>> Syntactically, "uber" is wrong but users sometimes search for this also.
>>>  
>>>  
>>> -- 
>>>
>>> Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С 
>>> уважением
>>> *i.A. Jürgen Wagner*
>>> Head of Competence Center "Intelligence"
>>> & Senior Cloud Consultant 
>>>
>>> Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
>>> Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 
>>> 1543
>>> E-Mail: [email protected], URL: www.devoteam.de
>>> ------------------------------
>>> Managing Board: Jürgen Hatzipantelis (CEO)
>>> Address of Record: 64331 Weiterstadt, Germany; Commercial Register: 
>>> Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071 
>>>
>>>
>>>    -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: char_filter for German

Reply via email to