Hi,

        Here's a fun bug with the sentence splitter:

./split-sentences.perl
Sentence Splitter v0
Language: en
Please fix the sentence splitter.  It does not handle double spaces.
Please fix the sentence splitter. It does not handle double spaces.


./split-sentences.perl
Sentence Splitter v0
Language: en
Please fix the sentence splitter. It does not handle double spaces.
Please fix the sentence splitter.
It does not handle double spaces.

Kenneth

On 05/11/2012 03:44 PM, Lane Schwartz wrote:
> I'm interested in being able to go from the Europarl v6 source tarball
> (http://www.statmt.org/europarl/v6/europarl.tgz) to a sentence-aligned
> parallel corpus for any given language pair.
>
> After extracting the contents of the tarball, there is one directory per
> language stored in a txt dir (ex: txt/da). There is
> sentence-align-corpus-perl, tools/split-sentences.perl, and
> tools/de-xml.perl.
>
> I assume the order of running these scripts is as follows. If someone
> more familiar with the Europarl release process could confirm or correct
> this, that would be appreciated.
>
> * Download europarl.tgz
>
> * Untar europarl.tgz
>
> * Run ./sentence-align-source-corpus.perl ${src} ${tgt}
>
> * Run ./tools/split-sentences.perl -l ${src} < aligned/${src}/*.txt >
> corpus.${src}
> * Run ./tools/split-sentences.perl -l ${src} < aligned/${tgt}/*.txt >
> corpus.${tgt}
>
> * Run ./tools/de-xml.perl corpus ${src} ${tgt} corpus.noxml
>
>
> Is this the recommended process for deriving a sentence-aligned parallel
> corpus for a language pair from the Europarl source release?
>
> Thanks,
> Lane Schwartz
>
>
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to