Re: char_filter for German

[email protected] Wed, 11 Mar 2015 09:34:30 -0700

Use "german_normalization"

"german_normalize" is the same filter I implemented in my plugin
https://github.com/jprante/elasticsearch-analysis-german/blob/master/src/main/java/org/xbib/elasticsearch/index/analysis/german/GermanAnalysisBinderProcessor.java
when it was not available in ES core.


Jörg

On Wed, Mar 11, 2015 at 3:11 PM, Krešimir Slugan <[email protected]>
wrote:

>
> Where is this "german_normalize" filter coming from? It solves my problem
> completely and magically but it's not documented anywhere (and seems like
> it's not part of ICU plugin either).
>
>
>
> What is also weird is that filter can not be used in global context, e.g.
> it's not possible to try something like this:
>
> curl -XGET
> 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize'
> -d 'this is a test'
>
> but it is possible to use it in index context:
>
> curl -XGET
> 'localhost:9200/test_index/_analyze?tokenizer=whitespace&filters=lowercase,german_normalize'
> -d 'this is a test'
>
>
> In first case I get "*ElasticsearchIllegalArgumentException[failed to
> find global token filter under [german_normalize]]*"
>
>
> On Sunday, November 30, 2014 at 5:20:16 PM UTC+1, Jörg Prante wrote:
>
>> Do not use regex, this will give wrong results.
>>
>> Elasticsearch comes with full support for german umlaut handling.
>>
>> If you install ICU plugin, you can use something like this analysis
>> setting
>>
>> {
>>     "index" : {
>>         "analysis" : {
>>             "filter" : {
>>                 "german_normalize_stem" : {
>>                   "type" : "snowball",
>>                   "name" : "German2"
>>                 }
>>             },
>>             "analyzer" : {
>>                 "stemmed" : {
>>                     "type" : "custom",
>>                     "tokenizer" : "standard",
>>                     "filter" : [
>>                         "lowercase",
>>                         "icu_normalizer",
>>                         "icu_folding",
>>                         "german_normalize_stem"
>>                     ]
>>                 },
>>                 "unstemmed" : {
>>                     "type" : "custom",
>>                     "tokenizer" : "standard",
>>                     "filter" : [
>>                         "lowercase",
>>                         "icu_normalizer",
>>                         "icu_folding",
>>                         "german_normalize"
>>                     ]
>>                 }
>>             }
>>         }
>>     }
>> }
>>
>> ICU handles german umlauts, and also case folding like "ss" and "ß".
>>
>> Snowball handles umlaut expansions (ae, oe, ue) at the right places in
>> words.
>>
>> You can choose between stemmed and unstemmed analysis. Snowball tends to
>> overstem words. The "german_normalize" token filter is copied from Snowball
>> but works without stemming.
>>
>> The effect of the combination is that all german words like Jörg,  Joerg,
>> Jorg are reduced to jorg in the index.
>>
>> Best,
>>
>> Jörg
>>
>>
>> On Sun, Nov 30, 2014 at 11:37 AM, Krešimir Slugan <[email protected]>
>> wrote:
>>
>>> Hi Jürgen,
>>>
>>> Currently we don't have big volumes of data to index so we would like to
>>> yield more results in hope that proper ones would still be shown in the
>>> top. In future, when we have more data, we'll have to sacrifice some use
>>> cases in order to provide more precise results for the rest of users.
>>>
>>> I think I will try regexp token approach to replace umlauts with "e"
>>> forms to solve this double expansion problem.
>>>
>>> Best,
>>>
>>> Krešimir
>>>
>>> On Saturday, November 29, 2014 11:23:47 PM UTC+1, Jürgen Wagner (DVT)
>>> wrote:
>>>>
>>>>  Hi Krešimir,
>>>>   the correct term is "über" (over, above) or "hören" (hear) or
>>>> "ändern" (change). When you cannot write umlauts, the correct alternative
>>>> spelling in print is "ueber", "hoeren", "aendern". Everybody can write this
>>>> in ASCII. However, those who are possibly non-speakers of German who still
>>>> want to search for German terms are usually not aware of this and believe
>>>> it's like with accents in French, where "á" is lexically treated like "a".
>>>> Those users are wrong in spelling "uber", "horen", "andern" because "u" and
>>>> "ü" are in fact different letters. It's like "ll" in Spanish. "ll" is ONE
>>>> letter :-)
>>>>
>>>> However, in order to provide a convenience to those users as well,  you
>>>> could decide that - to yield at least some meaningful results - you will
>>>> also consider the versions without the umlaut dots equivalent. In that
>>>> case, you want to map any token containing an umlaut (ä, ö, ü) to three
>>>> alternatives: umlaut, without umlaut marker, alternative spelling with 'e'.
>>>> This won't let you distinguish between the "Bar" (bar, the place to get a
>>>> drink) and "Bär" (bear, the one giving you a great, dangerous hug).
>>>> "Forderung" (demand) and "Förderung" (encouragement, facilitation,
>>>> promotion, extraction [geol.]) are also quite different, just to give a few
>>>> examples.
>>>>
>>>> For the proper recognition of those terms, you would normally use a
>>>> dictionary of German, including some frequent proper names as well. So, if
>>>> you look for "clown boll", you would not only get "Der Clown im Advent -
>>>> Evangelische Akademie Bad Boll", but also "Heinrich Böll, Ansichten eines
>>>> Clowns", because the query would be transformed into "clown AND (boll OR
>>>> boell OR böll)" as "boll" matches an umlaut candidate in your dictionary.
>>>> If you dare to normalize your indexed texts, so "Boell" would already have
>>>> been turned into "Böll", you could even do with a disjunction of only the
>>>> one correct form and the misspelling. Again, however, you would make use of
>>>> a dictionary to perform such normalization. Ideally, you would even have a
>>>> POS tagger in place, so you would only make such replacements where the
>>>> name Böll is referred to, not the city of Bad Boll.
>>>>
>>>> It's a question of how much effort makes sense for your application. If
>>>> you just want to index some German text, maybe you just want to turn all
>>>> umlauts into the plain vocals for the purpose of indexing, but still keep
>>>> the reference to the original for result display. Maybe that's sufficient.
>>>> For larger volumes of documents, a more precise approach is recommended to
>>>> avoid false positives.
>>>>
>>>> Cheers,
>>>> --Jürgen
>>>>
>>>>
>>>> On 29.11.2014 20:35, Krešimir Slugan wrote:
>>>>
>>>> Because, as far as I understand, in German it's semantically the same
>>>> to write über or ueber (although ueber is less often used). I guess this is
>>>> not true only for personal names.
>>>> Syntactically, "uber" is wrong but users sometimes search for this also.
>>>>
>>>>
>>>> --
>>>>
>>>> Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
>>>> уважением
>>>> *i.A. Jürgen Wagner*
>>>> Head of Competence Center "Intelligence"
>>>> & Senior Cloud Consultant
>>>>
>>>> Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
>>>> Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864
>>>> 1543
>>>> E-Mail: [email protected], URL: www.devoteam.de
>>>> ------------------------------
>>>> Managing Board: Jürgen Hatzipantelis (CEO)
>>>> Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
>>>> Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
>>>>
>>>>
>>>>    --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/elasticsearch/7592379b-0c07-4973-a705-f70a388c285d%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/50ab7573-50a5-4630-9bb5-53a0920de213%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEAM7q2c5Xe%3DMRyWwiy73rnB5ur--8xzF1BXDg-m9kQYQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: char_filter for German

Reply via email to