Re: [Moses-support] Europarl source release and processing scripts

Philipp Koehn Wed, 16 May 2012 08:16:42 -0700

Hi,

yes, that's correct.


-phi

On Wed, May 16, 2012 at 7:57 AM, Lane Schwartz <[email protected]> wrote:
> Philipp,
>
> So was my guess about script order correct, then?
>
>
> * Download europarl.tgz
>
> * Untar europarl.tgz
>
> * Run ./sentence-align-source-corpus.perl ${src} ${tgt}
>
> * Run ./tools/split-sentences.perl -l ${src} < aligned/${src}/*.txt >
> corpus.${src}
> * Run ./tools/split-sentences.perl -l ${src} < aligned/${tgt}/*.txt >
> corpus.${tgt}
>
> * Run ./tools/de-xml.perl corpus ${src} ${tgt} corpus.noxml
>
> Thanks,
> Lane
>
>
> On Wed, May 16, 2012 at 10:13 AM, Philipp Koehn <[email protected]> wrote:
>> Hi,
>>
>> The XML tags help with the alignment, so they need to be stripped after
>> sentence alignment, not before it.
>>
>> -phi
>>
>> On 16 May 2012 03:58, "Tom Hoar" <[email protected]>
>> wrote:
>>>
>>> Interesting question, Lane. I would have expected an almost opposite
>>> order, although I haven't looked at the commands for a while and using them
>>> is on my list of things to do. Can someone who has used them shed some
>>> light?
>>>
>>> * Download europarl.tgz
>>> * Untar europarl.tgz
>>> * Run ./tools/de-xml.perl < ${tgt}.txt > ${tgt}.noxml.txt
>>> * Run ./tools/de-xml.perl < ${src}.txt > ${src}.noxml.txt
>>> * Run ./tools/split-sentences.perl -l ${src} < ${src}.noxml.txt >
>>> ${src}.noxml.split.txt
>>> * Run ./tools/split-sentences.perl -l ${src} < ${tgt}.noxml.txt >
>>> ${tgt}.noxml.split.txt
>>> * Run ./sentence-align-source-corpus.perl ${src}.noxml.split.txt
>>> ${tgt}.noxml.split.txt
>>>
>>>
>>> On Fri, 11 May 2012 15:44:05 -0400, Lane Schwartz <[email protected]>
>>> wrote:
>>>
>>> I'm interested in being able to go from the Europarl v6 source tarball
>>> (http://www.statmt.org/europarl/v6/europarl.tgz) to a sentence-aligned
>>> parallel corpus for any given language pair.
>>>
>>> After extracting the contents of the tarball, there is one directory per
>>> language stored in a txt dir (ex: txt/da). There is
>>> sentence-align-corpus-perl, tools/split-sentences.perl, and
>>> tools/de-xml.perl.
>>>
>>> I assume the order of running these scripts is as follows. If someone more
>>> familiar with the Europarl release process could confirm or correct this,
>>> that would be appreciated.
>>>
>>> * Download europarl.tgz
>>>
>>> * Untar europarl.tgz
>>>
>>> * Run ./sentence-align-source-corpus.perl ${src} ${tgt}
>>>
>>> * Run ./tools/split-sentences.perl -l ${src} < aligned/${src}/*.txt >
>>> corpus.${src}
>>> * Run ./tools/split-sentences.perl -l ${src} < aligned/${tgt}/*.txt >
>>> corpus.${tgt}
>>>
>>> * Run ./tools/de-xml.perl corpus ${src} ${tgt} corpus.noxml
>>>
>>>
>>> Is this the recommended process for deriving a sentence-aligned parallel
>>> corpus for a language pair from the Europarl source release?
>>>
>>> Thanks,
>>> Lane Schwartz
>>>
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>
>
>
>
> --
> When a place gets crowded enough to require ID's, social collapse is not
> far away.  It is time to go elsewhere.  The best thing about space travel
> is that it made it possible to go elsewhere.
>                 -- R.A. Heinlein, "Time Enough For Love"

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Europarl source release and processing scripts

Reply via email to