Hi Jörn, [Sent again without the picture since Apache rejects those, unfortunately...]
You just need monolingual text, so I suggest downloading either the tokenized or untokenized versions. Unfortunately, Opus doesn't make it easy to provide directly links to individual languages. But do this: 1. Go to http://opus.lingfil.uu.se <http://opus.lingfil.uu.se/> 2. Choose de → en (or some other language pair) 3. In the "mono" or "raw" columns (depending on whether you want tokenized or untokenized text), click the language file for the dataset you want. matt > On Jan 12, 2017, at 6:07 AM, Joern Kottmann <kottm...@gmail.com > <mailto:kottm...@gmail.com>> wrote: > > Do you have a pointer to an actual file? Or download package? > > Jörn > > On Wed, Jan 11, 2017 at 11:33 AM, Tommaso Teofili <tommaso.teof...@gmail.com > <mailto:tommaso.teof...@gmail.com> >> wrote: > >> I think the parallel corpuses are taken from [1], so we could start with >> training sentdetect for language packs at [2]. >> >> Regards, >> Tommaso >> >> [1] : http://opus.lingfil.uu.se/ <http://opus.lingfil.uu.se/> >> [2] : https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs >> <https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs> >> >> Il giorno lun 9 gen 2017 alle ore 11:39 Joern Kottmann <kottm...@gmail.com >> <mailto:kottm...@gmail.com> >>> >> ha scritto: >> >>> Sorry, for late reply, can you point me to a link for the parallel >> corpus? >>> We might just want to add formats support for it to OpenNLP. >>> >>> Do you use tokenize.pl for all languages or do you have language >> specific >>> heuristics? >>> It would be great to have an additional more capable rule based tokenizer >>> in OpenNLP. >>> >>> The sentence splitter can be trained on a few thousand sentences or so, I >>> think that will work out nicely. >>> >>> Jörn >>> >>> On Wed, Dec 21, 2016 at 7:24 PM, Matt Post <p...@cs.jhu.edu >>> <mailto:p...@cs.jhu.edu>> wrote: >>> >>>> >>>>> On Dec 21, 2016, at 10:36 AM, Joern Kottmann <kottm...@gmail.com >>>>> <mailto:kottm...@gmail.com>> >>> wrote: >>>>> >>>>> I am happy to support a bit with this, we can also see if things in >>>> OpenNLP >>>>> need to be changed to make this work smoothly. >>>> >>>> Great! >>>> >>>> >>>>> One challenge is to train OpenNLP on all the languages you support. >> Do >>>> you >>>>> have training data that could be used to train the tokenizer and >>> sentence >>>>> detector? >>>> >>>> For the sentence-splitter, I imagine you could make use of the source >>> side >>>> of our parallel corpus, which has thousands to millions of sentences, >> one >>>> per line. >>>> >>>> For tokenization (and normalization), we don't typically train models >> but >>>> instead use a set of manually developed heuristics, which may or may >> not >>> be >>>> sentence-specific. See >>>> >>>> https://github.com/apache/incubator-joshua/blob/master/ >>>> <https://github.com/apache/incubator-joshua/blob/master/> >>>> scripts/preparation/tokenize.pl >>>> >>>> How much training data do you generally need for each task? >>>> >>>> >>>>> >>>>> Jörn >>>>> >>>> >>>> >>> >>