Use "german_normalization" "german_normalize" is the same filter I implemented in my plugin https://github.com/jprante/elasticsearch-analysis-german/blob/master/src/main/java/org/xbib/elasticsearch/index/analysis/german/GermanAnalysisBinderProcessor.java when it was not available in ES core.
Jörg On Wed, Mar 11, 2015 at 3:11 PM, Krešimir Slugan <[email protected]> wrote: > > Where is this "german_normalize" filter coming from? It solves my problem > completely and magically but it's not documented anywhere (and seems like > it's not part of ICU plugin either). > > > > What is also weird is that filter can not be used in global context, e.g. > it's not possible to try something like this: > > curl -XGET > 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize' > -d 'this is a test' > > but it is possible to use it in index context: > > curl -XGET > 'localhost:9200/test_index/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize' > -d 'this is a test' > > > In first case I get "*ElasticsearchIllegalArgumentException[failed to > find global token filter under [german_normalize]]*" > > > On Sunday, November 30, 2014 at 5:20:16 PM UTC+1, Jörg Prante wrote: > >> Do not use regex, this will give wrong results. >> >> Elasticsearch comes with full support for german umlaut handling. >> >> If you install ICU plugin, you can use something like this analysis >> setting >> >> { >> "index" : { >> "analysis" : { >> "filter" : { >> "german_normalize_stem" : { >> "type" : "snowball", >> "name" : "German2" >> } >> }, >> "analyzer" : { >> "stemmed" : { >> "type" : "custom", >> "tokenizer" : "standard", >> "filter" : [ >> "lowercase", >> "icu_normalizer", >> "icu_folding", >> "german_normalize_stem" >> ] >> }, >> "unstemmed" : { >> "type" : "custom", >> "tokenizer" : "standard", >> "filter" : [ >> "lowercase", >> "icu_normalizer", >> "icu_folding", >> "german_normalize" >> ] >> } >> } >> } >> } >> } >> >> ICU handles german umlauts, and also case folding like "ss" and "ß". >> >> Snowball handles umlaut expansions (ae, oe, ue) at the right places in >> words. >> >> You can choose between stemmed and unstemmed analysis. Snowball tends to >> overstem words. The "german_normalize" token filter is copied from Snowball >> but works without stemming. >> >> The effect of the combination is that all german words like Jörg, Joerg, >> Jorg are reduced to jorg in the index. >> >> Best, >> >> Jörg >> >> >> On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan <[email protected]> >> wrote: >> >>> Hi Jürgen, >>> >>> Currently we don't have big volumes of data to index so we would like to >>> yield more results in hope that proper ones would still be shown in the >>> top. In future, when we have more data, we'll have to sacrifice some use >>> cases in order to provide more precise results for the rest of users. >>> >>> I think I will try regexp token approach to replace umlauts with "e" >>> forms to solve this double expansion problem. >>> >>> Best, >>> >>> Krešimir >>> >>> On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT) >>> wrote: >>>> >>>> Hi Krešimir, >>>> the correct term is "über" (over, above) or "hören" (hear) or >>>> "ändern" (change). When you cannot write umlauts, the correct alternative >>>> spelling in print is "ueber", "hoeren", "aendern". Everybody can write this >>>> in ASCII. However, those who are possibly non-speakers of German who still >>>> want to search for German terms are usually not aware of this and believe >>>> it's like with accents in French, where "á" is lexically treated like "a". >>>> Those users are wrong in spelling "uber", "horen", "andern" because "u" and >>>> "ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE >>>> letter :-) >>>> >>>> However, in order to provide a convenience to those users as well, you >>>> could decide that - to yield at least some meaningful results - you will >>>> also consider the versions without the umlaut dots equivalent. In that >>>> case, you want to map any token containing an umlaut (ä, ö, ü) to three >>>> alternatives: umlaut, without umlaut marker, alternative spelling with 'e'. >>>> This won't let you distinguish between the "Bar" (bar, the place to get a >>>> drink) and "Bär" (bear, the one giving you a great, dangerous hug). >>>> "Forderung" (demand) and "Förderung" (encouragement, facilitation, >>>> promotion, extraction [geol.]) are also quite different, just to give a few >>>> examples. >>>> >>>> For the proper recognition of those terms, you would normally use a >>>> dictionary of German, including some frequent proper names as well. So, if >>>> you look for "clown boll", you would not only get "Der Clown im Advent - >>>> Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines >>>> Clowns", because the query would be transformed into "clown AND (boll OR >>>> boell OR böll)" as "boll" matches an umlaut candidate in your dictionary. >>>> If you dare to normalize your indexed texts, so "Boell" would already have >>>> been turned into "Böll", you could even do with a disjunction of only the >>>> one correct form and the misspelling. Again, however, you would make use of >>>> a dictionary to perform such normalization. Ideally, you would even have a >>>> POS tagger in place, so you would only make such replacements where the >>>> name Böll is referred to, not the city of Bad Boll. >>>> >>>> It's a question of how much effort makes sense for your application. If >>>> you just want to index some German text, maybe you just want to turn all >>>> umlauts into the plain vocals for the purpose of indexing, but still keep >>>> the reference to the original for result display. Maybe that's sufficient. >>>> For larger volumes of documents, a more precise approach is recommended to >>>> avoid false positives. >>>> >>>> Cheers, >>>> --Jürgen >>>> >>>> >>>> On 29.11.2014 20:35, Krešimir Slugan wrote: >>>> >>>> Because, as far as I understand, in German it's semantically the same >>>> to write über or ueber (although ueber is less often used). I guess this is >>>> not true only for personal names. >>>> Syntactically, "uber" is wrong but users sometimes search for this also. >>>> >>>> >>>> -- >>>> >>>> Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С >>>> уважением >>>> *i.A. Jürgen Wagner* >>>> Head of Competence Center "Intelligence" >>>> & Senior Cloud Consultant >>>> >>>> Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany >>>> Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 >>>> 1543 >>>> E-Mail: [email protected], URL: www.devoteam.de >>>> ------------------------------ >>>> Managing Board: Jürgen Hatzipantelis (CEO) >>>> Address of Record: 64331 Weiterstadt, Germany; Commercial Register: >>>> Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071 >>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "elasticsearch" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit https://groups.google.com/d/ >>> msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d% >>> 40googlegroups.com >>> <https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEAM7q2c5Xe%3DMRyWwiy73rnB5ur--8xzF1BXDg-m9kQYQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
