Hi, I deduped it. Thanks for the hint!
Uwe ----- Uwe Schindler Achterdiek 19, D-28357 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Uwe Schindler [mailto:u...@thetaphi.de] > Sent: Saturday, September 16, 2017 12:51 PM > To: java-user@lucene.apache.org > Subject: RE: German decompounding/tokenization with Lucene? > > Ok sorting and deduping should be easy with a simple command line. Reason > is that it was created from 2 files of Björn Jacke's Data. I thought that I > deduped it... > > Uwe > > Am 16. September 2017 12:46:29 MESZ schrieb Markus Jelsma > <markus.jel...@openindex.io>: > >Sorry, i would if i were on Github, but i am not. > > > >Thanks again! > >Markus > > > >-----Original message----- > >> From:Uwe Schindler <u...@thetaphi.de> > >> Sent: Saturday 16th September 2017 12:45 > >> To: java-user@lucene.apache.org > >> Subject: RE: German decompounding/tokenization with Lucene? > >> > >> Send a pull request. :) > >> > >> Uwe > >> > >> Am 16. September 2017 12:42:30 MESZ schrieb Markus Jelsma > ><markus.jel...@openindex.io>: > >> >Hello Uwe, > >> > > >> >Thanks for getting rid of the compounds. The dictionary can be > >smaller, > >> >it still has about 1500 duplicates. It is also unsorted. > >> > > >> >Regards, > >> >Markus > >> > > >> > > >> >-----Original message----- > >> >> From:Uwe Schindler <u...@thetaphi.de> > >> >> Sent: Saturday 16th September 2017 12:16 > >> >> To: java-user@lucene.apache.org > >> >> Subject: RE: German decompounding/tokenization with Lucene? > >> >> > >> >> Hi, > >> >> > >> >> I published my work on Github: > >> >> > >> >> https://github.com/uschindler/german-decompounder > >> >> > >> >> Have fun. I am not yet 100% sure about the License of the data > >file. > >> >The original > >> >> Author (Björn Jacke) did not publish any license; but LibreOffice > >> >publishes his files > >> >> Under LGPL. So to be safe, I applied the same license for my own > >> >work. > >> >> > >> >> Uwe > >> >> > >> >> ----- > >> >> Uwe Schindler > >> >> Achterdiek 19, D-28357 Bremen > >> >> http://www.thetaphi.de > >> >> eMail: u...@thetaphi.de > >> >> > >> >> > -----Original Message----- > >> >> > From: Uwe Schindler [mailto:u...@thetaphi.de] > >> >> > Sent: Saturday, September 16, 2017 9:49 AM > >> >> > To: java-user@lucene.apache.org > >> >> > Subject: RE: German decompounding/tokenization with Lucene? > >> >> > > >> >> > Hi Michael, > >> >> > > >> >> > I had this issue just yesterday. I did that several times and I > >> >built a good > >> >> > dictionary in the meantime. > >> >> > > >> >> > I have an example for Solr or Elasticsearch with the same data. > >It > >> >uses the > >> >> > HyphenationCompoundTokenFilter, but with ZIP file *and* > >dictionary > >> >(it's > >> >> > important to have both). The dictionary-only based one is just > >too > >> >slow and > >> >> > creates wrong matches, too. > >> >> > > >> >> > The rules file is the one from the openoffice hyphenation files. > >> >Just take it as > >> >> > is (keep in mind that you need to use the "old" version ZIP > >file, > >> >not the latest > >> >> > version, as the XML format was changed). The dictionary is more > >> >important: > >> >> > It should only contain the "single words", no compounds at all. > >> >This is hard to > >> >> > get, but there is a ngerman98.zip file available with an ispell > >> >dictionary > >> >> > (https://www.j3e.de/ispell/igerman98/). This dictionary has > >several > >> >variants, > >> >> > one of them only contains the single non-compound words (about > >> >17,000 > >> >> > items). This works for most cases. I converted the dictionary a > >> >bit, merged > >> >> > some files, and finally lowercased it and now I have a working > >> >solution. > >> >> > > >> >> > The settings for the hyphcompound filter are (Elasticsearch): > >> >> > > >> >> > "german_decompounder": { > >> >> > "type": "hyphenation_decompounder", > >> >> > "word_list_path": "analysis/dictionary-de.txt", > >> >> > "hyphenation_patterns_path": > >"analysis/de_DR.xml", > >> >> > "only_longest_match": true, > >> >> > "min_subword_size": 4 > >> >> > }, > >> >> > > >> >> > Important is the "only_longest_match" setting, because our > >> >dictionary for > >> >> > sure only contains "single words" (and some words that look like > >> >compounds > >> >> > bare aren't, as they were glued together. See the example in > >> >english > >> >> > "policeman" is not written "police man" in English, because it’s > >a > >> >word on its > >> >> > own). So the longest match is always safe as we have a "good > >> >maintained" > >> >> > dictionary. > >> >> > > >> >> > If you are interested I can send you a ZIP file with both files. > >> >Maybe I should > >> >> > check them into github, but I have to check licenses first. > >> >> > > >> >> > Uwe > >> >> > > >> >> > ----- > >> >> > Uwe Schindler > >> >> > Achterdiek 19, D-28357 Bremen > >> >> > http://www.thetaphi.de > >> >> > eMail: u...@thetaphi.de > >> >> > > >> >> > > -----Original Message----- > >> >> > > From: Michael McCandless [mailto:luc...@mikemccandless.com] > >> >> > > Sent: Saturday, September 16, 2017 12:58 AM > >> >> > > To: Lucene Users <java-user@lucene.apache.org> > >> >> > > Subject: German decompounding/tokenization with Lucene? > >> >> > > > >> >> > > Hello, > >> >> > > > >> >> > > I need to index documents with German text in Lucene, and I'm > >> >wondering > >> >> > > how > >> >> > > people have done this in the past? > >> >> > > > >> >> > > Lucene already has a DictionaryCompoundWordTokenFilter ... is > >> >this what > >> >> > > people use? Are there good, open-source friendly German > >> >dictionaries > >> >> > > available? > >> >> > > > >> >> > > Thanks, > >> >> > > > >> >> > > Mike McCandless > >> >> > > > >> >> > > http://blog.mikemccandless.com > >> >> > > >> >> > > >> >> > > >> > >>--------------------------------------------------------------------- > >> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> >> > For additional commands, e-mail: > >java-user-h...@lucene.apache.org > >> >> > >> >> > >> >> > >--------------------------------------------------------------------- > >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> >> > >> >> > >> > > >> > >>--------------------------------------------------------------------- > >> >To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> >For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> -- > >> Uwe Schindler > >> Achterdiek 19, 28357 Bremen > >> https://www.thetaphi.de > > > >--------------------------------------------------------------------- > >To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- > Uwe Schindler > Achterdiek 19, 28357 Bremen > https://www.thetaphi.de --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org