Hi Jörn,

[Sent again without the picture since Apache rejects those, unfortunately...]

You just need monolingual text, so I suggest downloading either the tokenized 
or untokenized versions. Unfortunately, Opus doesn't make it easy to provide 
directly links to individual languages. But do this:

1. Go to http://opus.lingfil.uu.se <http://opus.lingfil.uu.se/>

2. Choose de → en (or some other language pair)

3. In the "mono" or "raw" columns (depending on whether you want tokenized or 
untokenized text), click the language file for the dataset you want.

matt


> On Jan 12, 2017, at 6:07 AM, Joern Kottmann <kottm...@gmail.com 
> <mailto:kottm...@gmail.com>> wrote:
> 
> Do you have a pointer to an actual file? Or download package?
> 
> Jörn
> 
> On Wed, Jan 11, 2017 at 11:33 AM, Tommaso Teofili <tommaso.teof...@gmail.com 
> <mailto:tommaso.teof...@gmail.com>
>> wrote:
> 
>> I think the parallel corpuses are taken from [1], so we could start with
>> training sentdetect for language packs at [2].
>> 
>> Regards,
>> Tommaso
>> 
>> [1] : http://opus.lingfil.uu.se/ <http://opus.lingfil.uu.se/>
>> [2] : https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs 
>> <https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs>
>> 
>> Il giorno lun 9 gen 2017 alle ore 11:39 Joern Kottmann <kottm...@gmail.com 
>> <mailto:kottm...@gmail.com>
>>> 
>> ha scritto:
>> 
>>> Sorry, for late reply, can you point me to a link for the parallel
>> corpus?
>>> We might just want to add formats support for it to OpenNLP.
>>> 
>>> Do you use tokenize.pl for all languages or do you have language
>> specific
>>> heuristics?
>>> It would be great to have an additional more capable rule based tokenizer
>>> in OpenNLP.
>>> 
>>> The sentence splitter can be trained on a few thousand sentences or so, I
>>> think that will work out nicely.
>>> 
>>> Jörn
>>> 
>>> On Wed, Dec 21, 2016 at 7:24 PM, Matt Post <p...@cs.jhu.edu 
>>> <mailto:p...@cs.jhu.edu>> wrote:
>>> 
>>>> 
>>>>> On Dec 21, 2016, at 10:36 AM, Joern Kottmann <kottm...@gmail.com 
>>>>> <mailto:kottm...@gmail.com>>
>>> wrote:
>>>>> 
>>>>> I am happy to support a bit with this, we can also see if things in
>>>> OpenNLP
>>>>> need to be changed to make this work smoothly.
>>>> 
>>>> Great!
>>>> 
>>>> 
>>>>> One challenge is to train OpenNLP on all the languages you support.
>> Do
>>>> you
>>>>> have training data that could be used to train the tokenizer and
>>> sentence
>>>>> detector?
>>>> 
>>>> For the sentence-splitter, I imagine you could make use of the source
>>> side
>>>> of our parallel corpus, which has thousands to millions of sentences,
>> one
>>>> per line.
>>>> 
>>>> For tokenization (and normalization), we don't typically train models
>> but
>>>> instead use a set of manually developed heuristics, which may or may
>> not
>>> be
>>>> sentence-specific. See
>>>> 
>>>>        https://github.com/apache/incubator-joshua/blob/master/ 
>>>> <https://github.com/apache/incubator-joshua/blob/master/>
>>>> scripts/preparation/tokenize.pl
>>>> 
>>>> How much training data do you generally need for each task?
>>>> 
>>>> 
>>>>> 
>>>>> Jörn
>>>>> ​
>>>> 
>>>> 
>>> 
>> 

Reply via email to