Re: [Moses-support] Europarl monolingual pipeline

Kenneth Heafield Tue, 06 Jan 2015 14:22:33 -0800

Hi,

        It seems that the WMT release is missing data.  For example, why does
en/ep-10-02.txt line 4 contain "21 September 2000" but this does not
appear in the WMT europarl-v7.en file from the WMT site?


Kenneth

On 01/06/15 14:24, Philipp Koehn wrote:
> Hi,
> 
> the Perl script that was used to build this corpus is:
> 
> #!/usr/bin/perl -w
> 
> use strict;
> my ($l) = @ARGV;
> 
> my $data = "/home/pkoehn/statmt/data/europarl-v7";
> my $tools = "/home/pkoehn/statmt/data/europarl-v7/tools";
> my $preprocessor = "$tools/split-sentences.perl -q";
> 
> die("ERROR: no data for language $l") unless -e "$data/txt/$l";
> open(SPLIT,"cat $data/txt/$l/ep-00-0* $data/txt/$l/ep-0[123456789]*
> $data/txt/$l/ep-[19]* | $preprocessor -l $l |");
> while(<SPLIT>) {
>   next if /^\s*$/;
>   next if /^</;
>   print $_;
> }
> close(SPLIT);
> 
> 
> The sentence splitting code is in the tools package that comes
> with the Europarl source release.
> 
> -phi
> 
> On Tue, Jan 6, 2015 at 11:09 AM, Kenneth Heafield <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     Dear Moses,
> 
>             Where does this data come from?
> 
>     http://www.statmt.org/wmt13/training-monolingual-europarl-v7.tgz
> 
>     Specifically, if I wanted non-WMT languages, then I can download
>     Europarl from http://www.statmt.org/europarl/ .
> 
>             There are some tools, like a perl script to strip XML, but
>     that also
>     strips out <P> tags which are meant to be preserved for
>     split-sentences.perl.  And I don't think split-sentences.perl was
>     designed to run before stripping XML but could be wrong.
> 
>             Does one write a custom XML strip program to remove all the
>     tags except
>     <P> then pass it to split-sentences.perl?
> 
>     Kenneth
>     _______________________________________________
>     Moses-support mailing list
>     [email protected] <mailto:[email protected]>
>     http://mailman.mit.edu/mailman/listinfo/moses-support
> 
> 
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Europarl monolingual pipeline

Reply via email to