Thanks! I assume that "german_normalize" is also part of Decompounder Analysis Plugin ( https://github.com/jprante/elasticsearch-analysis-decompound ) since that is the only analysis plugin we have installed?
Btw. "german_normalization" doesn't seems to be available for our ES version (1.2), would you recommend upgrading instead of using "german_normalize"? Best, Kresimir On Wednesday, March 11, 2015 at 5:31:40 PM UTC+1, Jörg Prante wrote: > > Use "german_normalization" > > "german_normalize" is the same filter I implemented in my plugin > https://github.com/jprante/elasticsearch-analysis-german/blob/master/src/main/java/org/xbib/elasticsearch/index/analysis/german/GermanAnalysisBinderProcessor.java > > when it was not available in ES core. > > Jörg > > On Wed, Mar 11, 2015 at 3:11 PM, Krešimir Slugan <[email protected] > <javascript:>> wrote: > >> >> Where is this "german_normalize" filter coming from? It solves my problem >> completely and magically but it's not documented anywhere (and seems like >> it's not part of ICU plugin either). >> >> >> >> What is also weird is that filter can not be used in global context, e.g. >> it's not possible to try something like this: >> >> curl -XGET >> 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize' >> >> -d 'this is a test' >> >> but it is possible to use it in index context: >> >> curl -XGET >> 'localhost:9200/test_index/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize' >> >> -d 'this is a test' >> >> >> In first case I get "*ElasticsearchIllegalArgumentException[failed to >> find global token filter under [german_normalize]]*" >> >> >> On Sunday, November 30, 2014 at 5:20:16 PM UTC+1, Jörg Prante wrote: >> >>> Do not use regex, this will give wrong results. >>> >>> Elasticsearch comes with full support for german umlaut handling. >>> >>> If you install ICU plugin, you can use something like this analysis >>> setting >>> >>> { >>> "index" : { >>> "analysis" : { >>> "filter" : { >>> "german_normalize_stem" : { >>> "type" : "snowball", >>> "name" : "German2" >>> } >>> }, >>> "analyzer" : { >>> "stemmed" : { >>> "type" : "custom", >>> "tokenizer" : "standard", >>> "filter" : [ >>> "lowercase", >>> "icu_normalizer", >>> "icu_folding", >>> "german_normalize_stem" >>> ] >>> }, >>> "unstemmed" : { >>> "type" : "custom", >>> "tokenizer" : "standard", >>> "filter" : [ >>> "lowercase", >>> "icu_normalizer", >>> "icu_folding", >>> "german_normalize" >>> ] >>> } >>> } >>> } >>> } >>> } >>> >>> ICU handles german umlauts, and also case folding like "ss" and "ß". >>> >>> Snowball handles umlaut expansions (ae, oe, ue) at the right places in >>> words. >>> >>> You can choose between stemmed and unstemmed analysis. Snowball tends to >>> overstem words. The "german_normalize" token filter is copied from Snowball >>> but works without stemming. >>> >>> The effect of the combination is that all german words like Jörg, >>> Joerg, Jorg are reduced to jorg in the index. >>> >>> Best, >>> >>> Jörg >>> >>> >>> On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan <[email protected]> >>> wrote: >>> >>>> Hi Jürgen, >>>> >>>> Currently we don't have big volumes of data to index so we would like >>>> to yield more results in hope that proper ones would still be shown in the >>>> top. In future, when we have more data, we'll have to sacrifice some use >>>> cases in order to provide more precise results for the rest of users. >>>> >>>> I think I will try regexp token approach to replace umlauts with "e" >>>> forms to solve this double expansion problem. >>>> >>>> Best, >>>> >>>> Krešimir >>>> >>>> On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT) >>>> wrote: >>>>> >>>>> Hi Krešimir, >>>>> the correct term is "über" (over, above) or "hören" (hear) or >>>>> "ändern" (change). When you cannot write umlauts, the correct alternative >>>>> spelling in print is "ueber", "hoeren", "aendern". Everybody can write >>>>> this >>>>> in ASCII. However, those who are possibly non-speakers of German who >>>>> still >>>>> want to search for German terms are usually not aware of this and believe >>>>> it's like with accents in French, where "á" is lexically treated like >>>>> "a". >>>>> Those users are wrong in spelling "uber", "horen", "andern" because "u" >>>>> and >>>>> "ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE >>>>> letter :-) >>>>> >>>>> However, in order to provide a convenience to those users as well, >>>>> you could decide that - to yield at least some meaningful results - you >>>>> will also consider the versions without the umlaut dots equivalent. In >>>>> that >>>>> case, you want to map any token containing an umlaut (ä, ö, ü) to three >>>>> alternatives: umlaut, without umlaut marker, alternative spelling with >>>>> 'e'. >>>>> This won't let you distinguish between the "Bar" (bar, the place to get a >>>>> drink) and "Bär" (bear, the one giving you a great, dangerous hug). >>>>> "Forderung" (demand) and "Förderung" (encouragement, facilitation, >>>>> promotion, extraction [geol.]) are also quite different, just to give a >>>>> few >>>>> examples. >>>>> >>>>> For the proper recognition of those terms, you would normally use a >>>>> dictionary of German, including some frequent proper names as well. So, >>>>> if >>>>> you look for "clown boll", you would not only get "Der Clown im Advent - >>>>> Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines >>>>> Clowns", because the query would be transformed into "clown AND (boll OR >>>>> boell OR böll)" as "boll" matches an umlaut candidate in your dictionary. >>>>> If you dare to normalize your indexed texts, so "Boell" would already >>>>> have >>>>> been turned into "Böll", you could even do with a disjunction of only the >>>>> one correct form and the misspelling. Again, however, you would make use >>>>> of >>>>> a dictionary to perform such normalization. Ideally, you would even have >>>>> a >>>>> POS tagger in place, so you would only make such replacements where the >>>>> name Böll is referred to, not the city of Bad Boll. >>>>> >>>>> It's a question of how much effort makes sense for your application. >>>>> If you just want to index some German text, maybe you just want to turn >>>>> all >>>>> umlauts into the plain vocals for the purpose of indexing, but still keep >>>>> the reference to the original for result display. Maybe that's >>>>> sufficient. >>>>> For larger volumes of documents, a more precise approach is recommended >>>>> to >>>>> avoid false positives. >>>>> >>>>> Cheers, >>>>> --Jürgen >>>>> >>>>> >>>>> On 29.11.2014 20:35, Krešimir Slugan wrote: >>>>> >>>>> Because, as far as I understand, in German it's semantically the same >>>>> to write über or ueber (although ueber is less often used). I guess this >>>>> is >>>>> not true only for personal names. >>>>> Syntactically, "uber" is wrong but users sometimes search for this >>>>> also. >>>>> >>>>> >>>>> -- >>>>> >>>>> Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С >>>>> уважением >>>>> *i.A. Jürgen Wagner* >>>>> Head of Competence Center "Intelligence" >>>>> & Senior Cloud Consultant >>>>> >>>>> Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany >>>>> Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 >>>>> 1543 >>>>> E-Mail: [email protected], URL: www.devoteam.de >>>>> ------------------------------ >>>>> Managing Board: Jürgen Hatzipantelis (CEO) >>>>> Address of Record: 64331 Weiterstadt, Germany; Commercial Register: >>>>> Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071 >>>>> >>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "elasticsearch" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit https://groups.google.com/d/ >>>> msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d% >>>> 40googlegroups.com >>>> <https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com >> >> <https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/77c703e2-67ac-4cc9-89b0-f448b6ab9b20%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
