Hi,
It seems that the WMT release is missing data. For example, why does
en/ep-10-02.txt line 4 contain "21 September 2000" but this does not
appear in the WMT europarl-v7.en file from the WMT site?
Kenneth
On 01/06/15 14:24, Philipp Koehn wrote:
> Hi,
>
> the Perl script that was used to build this corpus is:
>
> #!/usr/bin/perl -w
>
> use strict;
> my ($l) = @ARGV;
>
> my $data = "/home/pkoehn/statmt/data/europarl-v7";
> my $tools = "/home/pkoehn/statmt/data/europarl-v7/tools";
> my $preprocessor = "$tools/split-sentences.perl -q";
>
> die("ERROR: no data for language $l") unless -e "$data/txt/$l";
> open(SPLIT,"cat $data/txt/$l/ep-00-0* $data/txt/$l/ep-0[123456789]*
> $data/txt/$l/ep-[19]* | $preprocessor -l $l |");
> while(<SPLIT>) {
> next if /^\s*$/;
> next if /^</;
> print $_;
> }
> close(SPLIT);
>
>
> The sentence splitting code is in the tools package that comes
> with the Europarl source release.
>
> -phi
>
> On Tue, Jan 6, 2015 at 11:09 AM, Kenneth Heafield <[email protected]
> <mailto:[email protected]>> wrote:
>
> Dear Moses,
>
> Where does this data come from?
>
> http://www.statmt.org/wmt13/training-monolingual-europarl-v7.tgz
>
> Specifically, if I wanted non-WMT languages, then I can download
> Europarl from http://www.statmt.org/europarl/ .
>
> There are some tools, like a perl script to strip XML, but
> that also
> strips out <P> tags which are meant to be preserved for
> split-sentences.perl. And I don't think split-sentences.perl was
> designed to run before stripping XML but could be wrong.
>
> Does one write a custom XML strip program to remove all the
> tags except
> <P> then pass it to split-sentences.perl?
>
> Kenneth
> _______________________________________________
> Moses-support mailing list
> [email protected] <mailto:[email protected]>
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support