RE: German decompounding/tokenization with Lucene?

Uwe Schindler Sat, 16 Sep 2017 03:19:24 -0700

Hi,

I published my work on Github:


https://github.com/uschindler/german-decompounder

Have fun. I am not yet 100% sure about the License of the data file. The 
original
Author (Björn Jacke) did not publish any license; but LibreOffice publishes his 
files
Under LGPL. So to be safe, I applied the same license for my own work.

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -----Original Message-----
> From: Uwe Schindler [mailto:u...@thetaphi.de]
> Sent: Saturday, September 16, 2017 9:49 AM
> To: java-user@lucene.apache.org
> Subject: RE: German decompounding/tokenization with Lucene?
> 
> Hi Michael,
> 
> I had this issue just yesterday. I did that several times and I built a good
> dictionary in the meantime.
> 
> I have an example for Solr or Elasticsearch with the same data. It uses the
> HyphenationCompoundTokenFilter, but with ZIP file *and* dictionary (it's
> important to have both). The dictionary-only based one is just too slow and
> creates wrong matches, too.
> 
> The rules file is the one from the openoffice hyphenation files. Just take it 
> as
> is (keep in mind that you need to use the "old" version ZIP file, not the 
> latest
> version, as the XML format was changed). The dictionary is more important:
> It should only contain the "single words", no compounds at all. This is hard 
> to
> get, but there is a ngerman98.zip file available with an ispell dictionary
> (https://www.j3e.de/ispell/igerman98/). This dictionary has several variants,
> one of them only contains the single non-compound words (about 17,000
> items). This works for most cases. I converted the dictionary a bit, merged
> some files, and finally lowercased it and now I have a working solution.
> 
> The settings for the hyphcompound filter are (Elasticsearch):
> 
>             "german_decompounder": {
>                "type": "hyphenation_decompounder",
>                "word_list_path": "analysis/dictionary-de.txt",
>                "hyphenation_patterns_path": "analysis/de_DR.xml",
>                "only_longest_match": true,
>                "min_subword_size": 4
>             },
> 
> Important is the "only_longest_match" setting, because our dictionary for
> sure only contains "single words" (and some words that look like compounds
> bare aren't, as they were glued together. See the example in english
> "policeman" is not written "police man" in English, because it’s a word on its
> own). So the longest match is always safe as we have a "good maintained"
> dictionary.
> 
> If you are interested I can send you a ZIP file with both files. Maybe I 
> should
> check them into github, but I have to check licenses first.
> 
> Uwe
> 
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> > -----Original Message-----
> > From: Michael McCandless [mailto:luc...@mikemccandless.com]
> > Sent: Saturday, September 16, 2017 12:58 AM
> > To: Lucene Users <java-user@lucene.apache.org>
> > Subject: German decompounding/tokenization with Lucene?
> >
> > Hello,
> >
> > I need to index documents with German text in Lucene, and I'm wondering
> > how
> > people have done this in the past?
> >
> > Lucene already has a DictionaryCompoundWordTokenFilter ... is this what
> > people use?  Are there good, open-source friendly German dictionaries
> > available?
> >
> > Thanks,
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: German decompounding/tokenization with Lucene?

Reply via email to