Re: [Moses-support] Europarl source release and processing scripts

Lane Schwartz Fri, 27 Jul 2012 13:54:37 -0700

So I finally got around to trying the Europarl processing scripts,
using the Europarl v7 data since it is now available. I've encountered
an oddity in the behavior of sentence-align-corpus.perl with respect
to its calls to tools/split-sentences.perl.


The problem is that in some cases, less sentence splitting appears to
occur when tools/split-sentences.perl is called from
sentence-align-corpus.perl than when it is called on its own.

I noticed this problem in the French text of ep-99-12-17.txt. If you
take a look at line 9 of that file, there are 3 sentences on the one
line.

After running sentence-align-corpus.perl fr en, I took a look at
aligned/fr-en/fr/ep-99-12-17.txt. The first two sentences of the old
line 9 are still together, but the third sentence has been broken off
onto its own line.

However, if I run tools/split-sentences.perl -l fr <
txt/fr/ep-99-12-17.txt, all three sentences from line 9 are split onto
their own lines.

I've tried debugging this, to no avail. Does anyone have any idea why
this might be?

Thanks,
Lane



On Wed, May 16, 2012 at 10:57 AM, Lane Schwartz <[email protected]> wrote:
> Philipp,
>
> So was my guess about script order correct, then?
>
>
> * Download europarl.tgz
>
> * Untar europarl.tgz
>
> * Run ./sentence-align-source-corpus.perl ${src} ${tgt}
>
> * Run ./tools/split-sentences.perl -l ${src} < aligned/${src}/*.txt >
> corpus.${src}
> * Run ./tools/split-sentences.perl -l ${src} < aligned/${tgt}/*.txt >
> corpus.${tgt}
>
> * Run ./tools/de-xml.perl corpus ${src} ${tgt} corpus.noxml
>
> Thanks,
> Lane
>
>
> On Wed, May 16, 2012 at 10:13 AM, Philipp Koehn <[email protected]> wrote:
>> Hi,
>>
>> The XML tags help with the alignment, so they need to be stripped after
>> sentence alignment, not before it.
>>
>> -phi
>>
>> On 16 May 2012 03:58, "Tom Hoar" <[email protected]>
>> wrote:
>>>
>>> Interesting question, Lane. I would have expected an almost opposite
>>> order, although I haven't looked at the commands for a while and using them
>>> is on my list of things to do. Can someone who has used them shed some
>>> light?
>>>
>>> * Download europarl.tgz
>>> * Untar europarl.tgz
>>> * Run ./tools/de-xml.perl < ${tgt}.txt > ${tgt}.noxml.txt
>>> * Run ./tools/de-xml.perl < ${src}.txt > ${src}.noxml.txt
>>> * Run ./tools/split-sentences.perl -l ${src} < ${src}.noxml.txt >
>>> ${src}.noxml.split.txt
>>> * Run ./tools/split-sentences.perl -l ${src} < ${tgt}.noxml.txt >
>>> ${tgt}.noxml.split.txt
>>> * Run ./sentence-align-source-corpus.perl ${src}.noxml.split.txt
>>> ${tgt}.noxml.split.txt
>>>
>>>
>>> On Fri, 11 May 2012 15:44:05 -0400, Lane Schwartz <[email protected]>
>>> wrote:
>>>
>>> I'm interested in being able to go from the Europarl v6 source tarball
>>> (http://www.statmt.org/europarl/v6/europarl.tgz) to a sentence-aligned
>>> parallel corpus for any given language pair.
>>>
>>> After extracting the contents of the tarball, there is one directory per
>>> language stored in a txt dir (ex: txt/da). There is
>>> sentence-align-corpus-perl, tools/split-sentences.perl, and
>>> tools/de-xml.perl.
>>>
>>> I assume the order of running these scripts is as follows. If someone more
>>> familiar with the Europarl release process could confirm or correct this,
>>> that would be appreciated.
>>>
>>> * Download europarl.tgz
>>>
>>> * Untar europarl.tgz
>>>
>>> * Run ./sentence-align-source-corpus.perl ${src} ${tgt}
>>>
>>> * Run ./tools/split-sentences.perl -l ${src} < aligned/${src}/*.txt >
>>> corpus.${src}
>>> * Run ./tools/split-sentences.perl -l ${src} < aligned/${tgt}/*.txt >
>>> corpus.${tgt}
>>>
>>> * Run ./tools/de-xml.perl corpus ${src} ${tgt} corpus.noxml
>>>
>>>
>>> Is this the recommended process for deriving a sentence-aligned parallel
>>> corpus for a language pair from the Europarl source release?
>>>
>>> Thanks,
>>> Lane Schwartz
>>>
>>>
>>>
>>> _______________________________________________
>>> Moses-support mailing list
>>> [email protected]
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>
>
>
>
> --
> When a place gets crowded enough to require ID's, social collapse is not
> far away.  It is time to go elsewhere.  The best thing about space travel
> is that it made it possible to go elsewhere.
>                 -- R.A. Heinlein, "Time Enough For Love"



-- 
When a place gets crowded enough to require ID's, social collapse is not
far away.  It is time to go elsewhere.  The best thing about space travel
is that it made it possible to go elsewhere.
                -- R.A. Heinlein, "Time Enough For Love"
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Europarl source release and processing scripts

Reply via email to