Hello Uwe, Thanks for getting rid of the compounds. The dictionary can be smaller, it still has about 1500 duplicates. It is also unsorted.
Regards, Markus -----Original message----- > From:Uwe Schindler <u...@thetaphi.de> > Sent: Saturday 16th September 2017 12:16 > To: java-user@lucene.apache.org > Subject: RE: German decompounding/tokenization with Lucene? > > Hi, > > I published my work on Github: > > https://github.com/uschindler/german-decompounder > > Have fun. I am not yet 100% sure about the License of the data file. The > original > Author (Björn Jacke) did not publish any license; but LibreOffice publishes > his files > Under LGPL. So to be safe, I applied the same license for my own work. > > Uwe > > ----- > Uwe Schindler > Achterdiek 19, D-28357 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -----Original Message----- > > From: Uwe Schindler [mailto:u...@thetaphi.de] > > Sent: Saturday, September 16, 2017 9:49 AM > > To: java-user@lucene.apache.org > > Subject: RE: German decompounding/tokenization with Lucene? > > > > Hi Michael, > > > > I had this issue just yesterday. I did that several times and I built a good > > dictionary in the meantime. > > > > I have an example for Solr or Elasticsearch with the same data. It uses the > > HyphenationCompoundTokenFilter, but with ZIP file *and* dictionary (it's > > important to have both). The dictionary-only based one is just too slow and > > creates wrong matches, too. > > > > The rules file is the one from the openoffice hyphenation files. Just take > > it as > > is (keep in mind that you need to use the "old" version ZIP file, not the > > latest > > version, as the XML format was changed). The dictionary is more important: > > It should only contain the "single words", no compounds at all. This is > > hard to > > get, but there is a ngerman98.zip file available with an ispell dictionary > > (https://www.j3e.de/ispell/igerman98/). This dictionary has several > > variants, > > one of them only contains the single non-compound words (about 17,000 > > items). This works for most cases. I converted the dictionary a bit, merged > > some files, and finally lowercased it and now I have a working solution. > > > > The settings for the hyphcompound filter are (Elasticsearch): > > > > "german_decompounder": { > > "type": "hyphenation_decompounder", > > "word_list_path": "analysis/dictionary-de.txt", > > "hyphenation_patterns_path": "analysis/de_DR.xml", > > "only_longest_match": true, > > "min_subword_size": 4 > > }, > > > > Important is the "only_longest_match" setting, because our dictionary for > > sure only contains "single words" (and some words that look like compounds > > bare aren't, as they were glued together. See the example in english > > "policeman" is not written "police man" in English, because it’s a word on > > its > > own). So the longest match is always safe as we have a "good maintained" > > dictionary. > > > > If you are interested I can send you a ZIP file with both files. Maybe I > > should > > check them into github, but I have to check licenses first. > > > > Uwe > > > > ----- > > Uwe Schindler > > Achterdiek 19, D-28357 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > > -----Original Message----- > > > From: Michael McCandless [mailto:luc...@mikemccandless.com] > > > Sent: Saturday, September 16, 2017 12:58 AM > > > To: Lucene Users <java-user@lucene.apache.org> > > > Subject: German decompounding/tokenization with Lucene? > > > > > > Hello, > > > > > > I need to index documents with German text in Lucene, and I'm wondering > > > how > > > people have done this in the past? > > > > > > Lucene already has a DictionaryCompoundWordTokenFilter ... is this what > > > people use? Are there good, open-source friendly German dictionaries > > > available? > > > > > > Thanks, > > > > > > Mike McCandless > > > > > > http://blog.mikemccandless.com > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org