Sorry, for late reply, can you point me to a link for the parallel corpus? We might just want to add formats support for it to OpenNLP.
Do you use tokenize.pl for all languages or do you have language specific heuristics? It would be great to have an additional more capable rule based tokenizer in OpenNLP. The sentence splitter can be trained on a few thousand sentences or so, I think that will work out nicely. Jörn On Wed, Dec 21, 2016 at 7:24 PM, Matt Post <p...@cs.jhu.edu> wrote: > > > On Dec 21, 2016, at 10:36 AM, Joern Kottmann <kottm...@gmail.com> wrote: > > > > I am happy to support a bit with this, we can also see if things in > OpenNLP > > need to be changed to make this work smoothly. > > Great! > > > > One challenge is to train OpenNLP on all the languages you support. Do > you > > have training data that could be used to train the tokenizer and sentence > > detector? > > For the sentence-splitter, I imagine you could make use of the source side > of our parallel corpus, which has thousands to millions of sentences, one > per line. > > For tokenization (and normalization), we don't typically train models but > instead use a set of manually developed heuristics, which may or may not be > sentence-specific. See > > https://github.com/apache/incubator-joshua/blob/master/ > scripts/preparation/tokenize.pl > > How much training data do you generally need for each task? > > > > > > Jörn > > > >