Hi,

The XML tags help with the alignment, so they need to be stripped after
sentence alignment, not before it.

-phi

On 16 May 2012 03:58, "Tom Hoar" <[email protected]>
wrote:

> Interesting question, Lane. I would have expected an almost opposite
> order, although I haven't looked at the commands for a while and using them
> is on my list of things to do. Can someone who has used them shed some
> light?
>
> * Download europarl.tgz
> * Untar europarl.tgz
> * Run ./tools/de-xml.perl < ${tgt}.txt > ${tgt}.noxml.txt
> * Run ./tools/de-xml.perl < ${src}.txt > ${src}.noxml.txt
> * Run ./tools/split-sentences.perl -l ${src} < ${src}.noxml.txt >
> ${src}.noxml.split.txt
> * Run ./tools/split-sentences.perl -l ${src} < ${tgt}.noxml.txt >
> ${tgt}.noxml.split.txt
> * Run ./sentence-align-source-corpus.perl ${src}.noxml.split.txt
> ${tgt}.noxml.split.txt
>
>
> On Fri, 11 May 2012 15:44:05 -0400, Lane Schwartz <[email protected]>
> wrote:
>
> I'm interested in being able to go from the Europarl v6 source tarball (
> http://www.statmt.org/europarl/v6/europarl.tgz) to a sentence-aligned
> parallel corpus for any given language pair.
>
> After extracting the contents of the tarball, there is one directory per
> language stored in a txt dir (ex: txt/da). There is
> sentence-align-corpus-perl, tools/split-sentences.perl, and
> tools/de-xml.perl.
>
> I assume the order of running these scripts is as follows. If someone more
> familiar with the Europarl release process could confirm or correct this,
> that would be appreciated.
>
> * Download europarl.tgz
>
> * Untar europarl.tgz
>
> * Run ./sentence-align-source-corpus.perl ${src} ${tgt}
>
> * Run ./tools/split-sentences.perl -l ${src} < aligned/${src}/*.txt >
> corpus.${src}
> * Run ./tools/split-sentences.perl -l ${src} < aligned/${tgt}/*.txt >
> corpus.${tgt}
>
> * Run ./tools/de-xml.perl corpus ${src} ${tgt} corpus.noxml
>
>
> Is this the recommended process for deriving a sentence-aligned parallel
> corpus for a language pair from the Europarl source release?
>
> Thanks,
> Lane Schwartz
>
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to