Hi, this is done on purpose - the Q4 2000 is used for test sets, so it is excluded from the parallel and monolingual training corpora.
-phi On Tue, Jan 6, 2015 at 2:20 PM, Kenneth Heafield <[email protected]> wrote: > Hi, > > It seems that the WMT release is missing data. For example, why does > en/ep-10-02.txt line 4 contain "21 September 2000" but this does not > appear in the WMT europarl-v7.en file from the WMT site? > > Kenneth > > On 01/06/15 14:24, Philipp Koehn wrote: >> Hi, >> >> the Perl script that was used to build this corpus is: >> >> #!/usr/bin/perl -w >> >> use strict; >> my ($l) = @ARGV; >> >> my $data = "/home/pkoehn/statmt/data/europarl-v7"; >> my $tools = "/home/pkoehn/statmt/data/europarl-v7/tools"; >> my $preprocessor = "$tools/split-sentences.perl -q"; >> >> die("ERROR: no data for language $l") unless -e "$data/txt/$l"; >> open(SPLIT,"cat $data/txt/$l/ep-00-0* $data/txt/$l/ep-0[123456789]* >> $data/txt/$l/ep-[19]* | $preprocessor -l $l |"); >> while(<SPLIT>) { >> next if /^\s*$/; >> next if /^</; >> print $_; >> } >> close(SPLIT); >> >> >> The sentence splitting code is in the tools package that comes >> with the Europarl source release. >> >> -phi >> >> On Tue, Jan 6, 2015 at 11:09 AM, Kenneth Heafield <[email protected] >> <mailto:[email protected]>> wrote: >> >> Dear Moses, >> >> Where does this data come from? >> >> http://www.statmt.org/wmt13/training-monolingual-europarl-v7.tgz >> >> Specifically, if I wanted non-WMT languages, then I can download >> Europarl from http://www.statmt.org/europarl/ . >> >> There are some tools, like a perl script to strip XML, but >> that also >> strips out <P> tags which are meant to be preserved for >> split-sentences.perl. And I don't think split-sentences.perl was >> designed to run before stripping XML but could be wrong. >> >> Does one write a custom XML strip program to remove all the >> tags except >> <P> then pass it to split-sentences.perl? >> >> Kenneth >> _______________________________________________ >> Moses-support mailing list >> [email protected] <mailto:[email protected]> >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
