Hi again,

        Sorry, never mind!

"We recommend using the last quarter of 2000 for testing (2000-10 until
2000-12) for consistency in reporting research results on this data."

Kenneth

On 01/06/15 17:20, Kenneth Heafield wrote:
> Hi,
> 
>       It seems that the WMT release is missing data.  For example, why does
> en/ep-10-02.txt line 4 contain "21 September 2000" but this does not
> appear in the WMT europarl-v7.en file from the WMT site?
> 
> Kenneth
> 
> On 01/06/15 14:24, Philipp Koehn wrote:
>> Hi,
>>
>> the Perl script that was used to build this corpus is:
>>
>> #!/usr/bin/perl -w
>>
>> use strict;
>> my ($l) = @ARGV;
>>
>> my $data = "/home/pkoehn/statmt/data/europarl-v7";
>> my $tools = "/home/pkoehn/statmt/data/europarl-v7/tools";
>> my $preprocessor = "$tools/split-sentences.perl -q";
>>
>> die("ERROR: no data for language $l") unless -e "$data/txt/$l";
>> open(SPLIT,"cat $data/txt/$l/ep-00-0* $data/txt/$l/ep-0[123456789]*
>> $data/txt/$l/ep-[19]* | $preprocessor -l $l |");
>> while(<SPLIT>) {
>>   next if /^\s*$/;
>>   next if /^</;
>>   print $_;
>> }
>> close(SPLIT);
>>
>>
>> The sentence splitting code is in the tools package that comes
>> with the Europarl source release.
>>
>> -phi
>>
>> On Tue, Jan 6, 2015 at 11:09 AM, Kenneth Heafield <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     Dear Moses,
>>
>>             Where does this data come from?
>>
>>     http://www.statmt.org/wmt13/training-monolingual-europarl-v7.tgz
>>
>>     Specifically, if I wanted non-WMT languages, then I can download
>>     Europarl from http://www.statmt.org/europarl/ .
>>
>>             There are some tools, like a perl script to strip XML, but
>>     that also
>>     strips out <P> tags which are meant to be preserved for
>>     split-sentences.perl.  And I don't think split-sentences.perl was
>>     designed to run before stripping XML but could be wrong.
>>
>>             Does one write a custom XML strip program to remove all the
>>     tags except
>>     <P> then pass it to split-sentences.perl?
>>
>>     Kenneth
>>     _______________________________________________
>>     Moses-support mailing list
>>     [email protected] <mailto:[email protected]>
>>     http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
> 
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to