Re: char_filter for German

Jürgen Wagner (DVT) Sat, 29 Nov 2014 14:23:58 -0800

Hi Krešimir,
  the correct term is "über" (over, above) or "hören" (hear) or "ändern"
(change). When you cannot write umlauts, the correct alternative
spelling in print is "ueber", "hoeren", "aendern". Everybody can write
this in ASCII. However, those who are possibly non-speakers of German
who still want to search for German terms are usually not aware of this
and believe it's like with accents in French, where "á" is lexically
treated like "a". Those users are wrong in spelling "uber", "horen",
"andern" because "u" and "ü" are in fact different letters. It's like
"ll" in Spanish. "ll" is ONE letter :-)

However, in order to provide a convenience to those users as well,  you
could decide that - to yield at least some meaningful results - you will
also consider the versions without the umlaut dots equivalent. In that
case, you want to map any token containing an umlaut (ä, ö, ü) to three
alternatives: umlaut, without umlaut marker, alternative spelling with
'e'. This won't let you distinguish between the "Bar" (bar, the place to
get a drink) and "Bär" (bear, the one giving you a great, dangerous
hug). "Forderung" (demand) and "Förderung" (encouragement, facilitation,
promotion, extraction [geol.]) are also quite different, just to give a
few examples.

For the proper recognition of those terms, you would normally use a
dictionary of German, including some frequent proper names as well. So,
if you look for "clown boll", you would not only get "Der Clown im
Advent - Evangelische Akademie Bad Boll", but also "Heinrich Böll,
Ansichten eines Clowns", because the query would be transformed into
"clown AND (boll OR boell OR böll)" as "boll" matches an umlaut
candidate in your dictionary. If you dare to normalize your indexed
texts, so "Boell" would already have been turned into "Böll", you could
even do with a disjunction of only the one correct form and the
misspelling. Again, however, you would make use of a dictionary to
perform such normalization. Ideally, you would even have a POS tagger in
place, so you would only make such replacements where the name Böll is
referred to, not the city of Bad Boll.

It's a question of how much effort makes sense for your application. If
you just want to index some German text, maybe you just want to turn all
umlauts into the plain vocals for the purpose of indexing, but still
keep the reference to the original for result display. Maybe that's
sufficient. For larger volumes of documents, a more precise approach is
recommended to avoid false positives.

Cheers,
--Jürgen

On 29.11.2014 20:35, Krešimir Slugan wrote:
> Because, as far as I understand, in German it's semantically the same
> to write über or ueber (although ueber is less often used). I guess
> this is not true only for personal names.
> Syntactically, "uber" is wrong but users sometimes search for this also.
>

-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: [email protected]
<mailto:[email protected]>, URL: www.devoteam.de
<http://www.devoteam.de/>

------------------------------------------------------------------------
Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/547A4766.50103%40devoteam.com.
For more options, visit https://groups.google.com/d/optout.

<<attachment: juergen_wagner.vcf>>

Re: char_filter for German

Reply via email to