Hello Jörg,
could you maybe share the configuration for the german_normalize analyzer
without stemming? I actually only need the umlaut expansion. And what do
you mean by "at the right places in words" for snowball?
Thanks!
Andrej
Am Sonntag, 30. November 2014 17:20:16 UTC+1 schrieb Jörg Prante:
>
> Do not use regex, this will give wrong results.
>
> Elasticsearch comes with full support for german umlaut handling.
>
> If you install ICU plugin, you can use something like this analysis setting
>
> {
> "index" : {
> "analysis" : {
> "filter" : {
> "german_normalize_stem" : {
> "type" : "snowball",
> "name" : "German2"
> }
> },
> "analyzer" : {
> "stemmed" : {
> "type" : "custom",
> "tokenizer" : "standard",
> "filter" : [
> "lowercase",
> "icu_normalizer",
> "icu_folding",
> "german_normalize_stem"
> ]
> },
> "unstemmed" : {
> "type" : "custom",
> "tokenizer" : "standard",
> "filter" : [
> "lowercase",
> "icu_normalizer",
> "icu_folding",
> "german_normalize"
> ]
> }
> }
> }
> }
> }
>
> ICU handles german umlauts, and also case folding like "ss" and "ß".
>
> Snowball handles umlaut expansions (ae, oe, ue) at the right places in
> words.
>
> You can choose between stemmed and unstemmed analysis. Snowball tends to
> overstem words. The "german_normalize" token filter is copied from Snowball
> but works without stemming.
>
> The effect of the combination is that all german words like Jörg, Joerg,
> Jorg are reduced to jorg in the index.
>
> Best,
>
> Jörg
>
>
> On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan <[email protected]
> <javascript:>> wrote:
>
>> Hi Jürgen,
>>
>> Currently we don't have big volumes of data to index so we would like to
>> yield more results in hope that proper ones would still be shown in the
>> top. In future, when we have more data, we'll have to sacrifice some use
>> cases in order to provide more precise results for the rest of users.
>>
>> I think I will try regexp token approach to replace umlauts with "e"
>> forms to solve this double expansion problem.
>>
>> Best,
>>
>> Krešimir
>>
>> On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT)
>> wrote:
>>>
>>> Hi Krešimir,
>>> the correct term is "über" (over, above) or "hören" (hear) or "ändern"
>>> (change). When you cannot write umlauts, the correct alternative spelling
>>> in print is "ueber", "hoeren", "aendern". Everybody can write this in
>>> ASCII. However, those who are possibly non-speakers of German who still
>>> want to search for German terms are usually not aware of this and believe
>>> it's like with accents in French, where "á" is lexically treated like "a".
>>> Those users are wrong in spelling "uber", "horen", "andern" because "u" and
>>> "ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE
>>> letter :-)
>>>
>>> However, in order to provide a convenience to those users as well, you
>>> could decide that - to yield at least some meaningful results - you will
>>> also consider the versions without the umlaut dots equivalent. In that
>>> case, you want to map any token containing an umlaut (ä, ö, ü) to three
>>> alternatives: umlaut, without umlaut marker, alternative spelling with 'e'.
>>> This won't let you distinguish between the "Bar" (bar, the place to get a
>>> drink) and "Bär" (bear, the one giving you a great, dangerous hug).
>>> "Forderung" (demand) and "Förderung" (encouragement, facilitation,
>>> promotion, extraction [geol.]) are also quite different, just to give a few
>>> examples.
>>>
>>> For the proper recognition of those terms, you would normally use a
>>> dictionary of German, including some frequent proper names as well. So, if
>>> you look for "clown boll", you would not only get "Der Clown im Advent -
>>> Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines
>>> Clowns", because the query would be transformed into "clown AND (boll OR
>>> boell OR böll)" as "boll" matches an umlaut candidate in your dictionary.
>>> If you dare to normalize your indexed texts, so "Boell" would already have
>>> been turned into "Böll", you could even do with a disjunction of only the
>>> one correct form and the misspelling. Again, however, you would make use of
>>> a dictionary to perform such normalization. Ideally, you would even have a
>>> POS tagger in place, so you would only make such replacements where the
>>> name Böll is referred to, not the city of Bad Boll.
>>>
>>> It's a question of how much effort makes sense for your application. If
>>> you just want to index some German text, maybe you just want to turn all
>>> umlauts into the plain vocals for the purpose of indexing, but still keep
>>> the reference to the original for result display. Maybe that's sufficient.
>>> For larger volumes of documents, a more precise approach is recommended to
>>> avoid false positives.
>>>
>>> Cheers,
>>> --Jürgen
>>>
>>>
>>> On 29.11.2014 20:35, Krešimir Slugan wrote:
>>>
>>> Because, as far as I understand, in German it's semantically the same to
>>> write über or ueber (although ueber is less often used). I guess this is
>>> not true only for personal names.
>>> Syntactically, "uber" is wrong but users sometimes search for this also.
>>>
>>>
>>> --
>>>
>>> Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
>>> уважением
>>> *i.A. Jürgen Wagner*
>>> Head of Competence Center "Intelligence"
>>> & Senior Cloud Consultant
>>>
>>> Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
>>> Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864
>>> 1543
>>> E-Mail: [email protected], URL: www.devoteam.de
>>> ------------------------------
>>> Managing Board: Jürgen Hatzipantelis (CEO)
>>> Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
>>> Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
>>>
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com
>>
>> <https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0b7484e8-5752-4bf4-878f-342abadbc5d5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.