RE: German decompounding/tokenization with Lucene?

Markus Jelsma Sat, 16 Sep 2017 03:45:07 -0700

Hello Uwe,

Thanks for getting rid of the compounds. The dictionary can be smaller, it 
still has about 1500 duplicates. It is also unsorted.


Regards,
Markus


-----Original message-----
> From:Uwe Schindler <u...@thetaphi.de>
> Sent: Saturday 16th September 2017 12:16
> To: java-user@lucene.apache.org
> Subject: RE: German decompounding/tokenization with Lucene?
> 
> Hi,
> 
> I published my work on Github:
> 
> https://github.com/uschindler/german-decompounder
> 
> Have fun. I am not yet 100% sure about the License of the data file. The 
> original
> Author (Björn Jacke) did not publish any license; but LibreOffice publishes 
> his files
> Under LGPL. So to be safe, I applied the same license for my own work.
> 
> Uwe
> 
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> > -----Original Message-----
> > From: Uwe Schindler [mailto:u...@thetaphi.de]
> > Sent: Saturday, September 16, 2017 9:49 AM
> > To: java-user@lucene.apache.org
> > Subject: RE: German decompounding/tokenization with Lucene?
> > 
> > Hi Michael,
> > 
> > I had this issue just yesterday. I did that several times and I built a good
> > dictionary in the meantime.
> > 
> > I have an example for Solr or Elasticsearch with the same data. It uses the
> > HyphenationCompoundTokenFilter, but with ZIP file *and* dictionary (it's
> > important to have both). The dictionary-only based one is just too slow and
> > creates wrong matches, too.
> > 
> > The rules file is the one from the openoffice hyphenation files. Just take 
> > it as
> > is (keep in mind that you need to use the "old" version ZIP file, not the 
> > latest
> > version, as the XML format was changed). The dictionary is more important:
> > It should only contain the "single words", no compounds at all. This is 
> > hard to
> > get, but there is a ngerman98.zip file available with an ispell dictionary
> > (https://www.j3e.de/ispell/igerman98/). This dictionary has several 
> > variants,
> > one of them only contains the single non-compound words (about 17,000
> > items). This works for most cases. I converted the dictionary a bit, merged
> > some files, and finally lowercased it and now I have a working solution.
> > 
> > The settings for the hyphcompound filter are (Elasticsearch):
> > 
> >             "german_decompounder": {
> >                "type": "hyphenation_decompounder",
> >                "word_list_path": "analysis/dictionary-de.txt",
> >                "hyphenation_patterns_path": "analysis/de_DR.xml",
> >                "only_longest_match": true,
> >                "min_subword_size": 4
> >             },
> > 
> > Important is the "only_longest_match" setting, because our dictionary for
> > sure only contains "single words" (and some words that look like compounds
> > bare aren't, as they were glued together. See the example in english
> > "policeman" is not written "police man" in English, because it’s a word on 
> > its
> > own). So the longest match is always safe as we have a "good maintained"
> > dictionary.
> > 
> > If you are interested I can send you a ZIP file with both files. Maybe I 
> > should
> > check them into github, but I have to check licenses first.
> > 
> > Uwe
> > 
> > -----
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> > 
> > > -----Original Message-----
> > > From: Michael McCandless [mailto:luc...@mikemccandless.com]
> > > Sent: Saturday, September 16, 2017 12:58 AM
> > > To: Lucene Users <java-user@lucene.apache.org>
> > > Subject: German decompounding/tokenization with Lucene?
> > >
> > > Hello,
> > >
> > > I need to index documents with German text in Lucene, and I'm wondering
> > > how
> > > people have done this in the past?
> > >
> > > Lucene already has a DictionaryCompoundWordTokenFilter ... is this what
> > > people use?  Are there good, open-source friendly German dictionaries
> > > available?
> > >
> > > Thanks,
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: German decompounding/tokenization with Lucene?

Reply via email to