Re: char_filter for German

Krešimir Slugan Wed, 11 Mar 2015 13:56:14 -0700

Thanks!

I assume that "german_normalize" is also part of Decompounder Analysis 
Plugin ( https://github.com/jprante/elasticsearch-analysis-decompound ) 
since that is the only analysis plugin we have installed?


Btw. "german_normalization" doesn't seems to be available for our ES 
version (1.2), would you recommend upgrading instead of using 
 "german_normalize"?

Best,

Kresimir

On Wednesday, March 11, 2015 at 5:31:40 PM UTC+1, Jörg Prante wrote:
>
> Use "german_normalization"
>
> "german_normalize" is the same filter I implemented in my plugin 
> https://github.com/jprante/elasticsearch-analysis-german/blob/master/src/main/java/org/xbib/elasticsearch/index/analysis/german/GermanAnalysisBinderProcessor.java
>  
> when it was not available in ES core.
>
> Jörg
>
> On Wed, Mar 11, 2015 at 3:11 PM, Krešimir Slugan <[email protected] 
> <javascript:>> wrote:
>
>>
>> Where is this "german_normalize" filter coming from? It solves my problem 
>> completely and magically but it's not documented anywhere (and seems like 
>> it's not part of ICU plugin either). 
>>
>>  
>>
>> What is also weird is that filter can not be used in global context, e.g. 
>> it's not possible to try something like this: 
>>
>> curl -XGET 
>> 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize'
>>  
>> -d 'this is a test'
>>
>> but it is possible to use it in index context:
>>
>> curl -XGET 
>> 'localhost:9200/test_index/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize'
>>  
>> -d 'this is a test'
>>
>>
>> In first case I get "*ElasticsearchIllegalArgumentException[failed to 
>> find global token filter under [german_normalize]]*"
>>
>>
>> On Sunday, November 30, 2014 at 5:20:16 PM UTC+1, Jörg Prante wrote:
>>
>>> Do not use regex, this will give wrong results.
>>>
>>> Elasticsearch comes with full support for german umlaut handling.
>>>
>>> If you install ICU plugin, you can use something like this analysis 
>>> setting
>>>
>>> {
>>>     "index" : {
>>>         "analysis" : {
>>>             "filter" : {
>>>                 "german_normalize_stem" : {
>>>                   "type" : "snowball",
>>>                   "name" : "German2"
>>>                 }
>>>             },
>>>             "analyzer" : {
>>>                 "stemmed" : {
>>>                     "type" : "custom",
>>>                     "tokenizer" : "standard",
>>>                     "filter" : [
>>>                         "lowercase",
>>>                         "icu_normalizer",
>>>                         "icu_folding",
>>>                         "german_normalize_stem"
>>>                     ]
>>>                 },
>>>                 "unstemmed" : {
>>>                     "type" : "custom",
>>>                     "tokenizer" : "standard",
>>>                     "filter" : [
>>>                         "lowercase",
>>>                         "icu_normalizer",
>>>                         "icu_folding",
>>>                         "german_normalize"
>>>                     ]
>>>                 }
>>>             }
>>>         }
>>>     }
>>> }
>>>
>>> ICU handles german umlauts, and also case folding like "ss" and "ß".
>>>
>>> Snowball handles umlaut expansions (ae, oe, ue) at the right places in 
>>> words.
>>>
>>> You can choose between stemmed and unstemmed analysis. Snowball tends to 
>>> overstem words. The "german_normalize" token filter is copied from Snowball 
>>> but works without stemming.
>>>
>>> The effect of the combination is that all german words like Jörg, 
>>>  Joerg, Jorg are reduced to jorg in the index.
>>>
>>> Best,
>>>
>>> Jörg
>>>
>>>
>>> On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan <[email protected]> 
>>> wrote:
>>>
>>>> Hi Jürgen,
>>>>
>>>> Currently we don't have big volumes of data to index so we would like 
>>>> to yield more results in hope that proper ones would still be shown in the 
>>>> top. In future, when we have more data, we'll have to sacrifice some use 
>>>> cases in order to provide more precise results for the rest of users. 
>>>>
>>>> I think I will try regexp token approach to replace umlauts with "e" 
>>>> forms to solve this double expansion problem. 
>>>>
>>>> Best,
>>>>
>>>> Krešimir
>>>>
>>>> On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT) 
>>>> wrote:
>>>>>
>>>>>  Hi Krešimir,
>>>>>   the correct term is "über" (over, above) or "hören" (hear) or 
>>>>> "ändern" (change). When you cannot write umlauts, the correct alternative 
>>>>> spelling in print is "ueber", "hoeren", "aendern". Everybody can write 
>>>>> this 
>>>>> in ASCII. However, those who are possibly non-speakers of German who 
>>>>> still 
>>>>> want to search for German terms are usually not aware of this and believe 
>>>>> it's like with accents in French, where "á" is lexically treated like 
>>>>> "a". 
>>>>> Those users are wrong in spelling "uber", "horen", "andern" because "u" 
>>>>> and 
>>>>> "ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE 
>>>>> letter :-)
>>>>>
>>>>> However, in order to provide a convenience to those users as well,  
>>>>> you could decide that - to yield at least some meaningful results - you 
>>>>> will also consider the versions without the umlaut dots equivalent. In 
>>>>> that 
>>>>> case, you want to map any token containing an umlaut (ä, ö, ü) to three 
>>>>> alternatives: umlaut, without umlaut marker, alternative spelling with 
>>>>> 'e'. 
>>>>> This won't let you distinguish between the "Bar" (bar, the place to get a 
>>>>> drink) and "Bär" (bear, the one giving you a great, dangerous hug). 
>>>>> "Forderung" (demand) and "Förderung" (encouragement, facilitation, 
>>>>> promotion, extraction [geol.]) are also quite different, just to give a 
>>>>> few 
>>>>> examples.
>>>>>
>>>>> For the proper recognition of those terms, you would normally use a 
>>>>> dictionary of German, including some frequent proper names as well. So, 
>>>>> if 
>>>>> you look for "clown boll", you would not only get "Der Clown im Advent - 
>>>>> Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines 
>>>>> Clowns", because the query would be transformed into "clown AND (boll OR 
>>>>> boell OR böll)" as "boll" matches an umlaut candidate in your dictionary. 
>>>>> If you dare to normalize your indexed texts, so "Boell" would already 
>>>>> have 
>>>>> been turned into "Böll", you could even do with a disjunction of only the 
>>>>> one correct form and the misspelling. Again, however, you would make use 
>>>>> of 
>>>>> a dictionary to perform such normalization. Ideally, you would even have 
>>>>> a 
>>>>> POS tagger in place, so you would only make such replacements where the 
>>>>> name Böll is referred to, not the city of Bad Boll.
>>>>>
>>>>> It's a question of how much effort makes sense for your application. 
>>>>> If you just want to index some German text, maybe you just want to turn 
>>>>> all 
>>>>> umlauts into the plain vocals for the purpose of indexing, but still keep 
>>>>> the reference to the original for result display. Maybe that's 
>>>>> sufficient. 
>>>>> For larger volumes of documents, a more precise approach is recommended 
>>>>> to 
>>>>> avoid false positives.
>>>>>
>>>>> Cheers,
>>>>> --Jürgen
>>>>>
>>>>>
>>>>> On 29.11.2014 20:35, Krešimir Slugan wrote:
>>>>>  
>>>>> Because, as far as I understand, in German it's semantically the same 
>>>>> to write über or ueber (although ueber is less often used). I guess this 
>>>>> is 
>>>>> not true only for personal names. 
>>>>> Syntactically, "uber" is wrong but users sometimes search for this 
>>>>> also.
>>>>>  
>>>>>  
>>>>> -- 
>>>>>
>>>>> Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С 
>>>>> уважением
>>>>> *i.A. Jürgen Wagner*
>>>>> Head of Competence Center "Intelligence"
>>>>> & Senior Cloud Consultant 
>>>>>
>>>>> Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
>>>>> Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 
>>>>> 1543
>>>>> E-Mail: [email protected], URL: www.devoteam.de
>>>>> ------------------------------
>>>>> Managing Board: Jürgen Hatzipantelis (CEO)
>>>>> Address of Record: 64331 Weiterstadt, Germany; Commercial Register: 
>>>>> Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071 
>>>>>
>>>>>
>>>>>    -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%
>>>> 40googlegroups.com 
>>>> <https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/77c703e2-67ac-4cc9-89b0-f448b6ab9b20%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: char_filter for German

Reply via email to