Hi,

the Perl script that was used to build this corpus is:

#!/usr/bin/perl -w

use strict;
my ($l) = @ARGV;

my $data = "/home/pkoehn/statmt/data/europarl-v7";
my $tools = "/home/pkoehn/statmt/data/europarl-v7/tools";
my $preprocessor = "$tools/split-sentences.perl -q";

die("ERROR: no data for language $l") unless -e "$data/txt/$l";
open(SPLIT,"cat $data/txt/$l/ep-00-0* $data/txt/$l/ep-0[123456789]*
$data/txt/$l/ep-[19]* | $preprocessor -l $l |");
while(<SPLIT>) {
  next if /^\s*$/;
  next if /^</;
  print $_;
}
close(SPLIT);


The sentence splitting code is in the tools package that comes
with the Europarl source release.

-phi

On Tue, Jan 6, 2015 at 11:09 AM, Kenneth Heafield <[email protected]>
wrote:

> Dear Moses,
>
>         Where does this data come from?
>
> http://www.statmt.org/wmt13/training-monolingual-europarl-v7.tgz
>
> Specifically, if I wanted non-WMT languages, then I can download
> Europarl from http://www.statmt.org/europarl/ .
>
>         There are some tools, like a perl script to strip XML, but that
> also
> strips out <P> tags which are meant to be preserved for
> split-sentences.perl.  And I don't think split-sentences.perl was
> designed to run before stripping XML but could be wrong.
>
>         Does one write a custom XML strip program to remove all the tags
> except
> <P> then pass it to split-sentences.perl?
>
> Kenneth
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to