Hi,
the Perl script that was used to build this corpus is:
#!/usr/bin/perl -w
use strict;
my ($l) = @ARGV;
my $data = "/home/pkoehn/statmt/data/europarl-v7";
my $tools = "/home/pkoehn/statmt/data/europarl-v7/tools";
my $preprocessor = "$tools/split-sentences.perl -q";
die("ERROR: no data for language $l") unless -e "$data/txt/$l";
open(SPLIT,"cat $data/txt/$l/ep-00-0* $data/txt/$l/ep-0[123456789]*
$data/txt/$l/ep-[19]* | $preprocessor -l $l |");
while(<SPLIT>) {
next if /^\s*$/;
next if /^</;
print $_;
}
close(SPLIT);
The sentence splitting code is in the tools package that comes
with the Europarl source release.
-phi
On Tue, Jan 6, 2015 at 11:09 AM, Kenneth Heafield <[email protected]>
wrote:
> Dear Moses,
>
> Where does this data come from?
>
> http://www.statmt.org/wmt13/training-monolingual-europarl-v7.tgz
>
> Specifically, if I wanted non-WMT languages, then I can download
> Europarl from http://www.statmt.org/europarl/ .
>
> There are some tools, like a perl script to strip XML, but that
> also
> strips out <P> tags which are meant to be preserved for
> split-sentences.perl. And I don't think split-sentences.perl was
> designed to run before stripping XML but could be wrong.
>
> Does one write a custom XML strip program to remove all the tags
> except
> <P> then pass it to split-sentences.perl?
>
> Kenneth
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support