I already answered this question in another post. Apologies for double posting.
Here is the code I used for filtering. I filtered based on the fourth score
only.
#!/usr/bin/perl -w
#
# Program filters phrase table to leave only phrase pairs
# with probability above a threshold
#
use strict;
use warnings;
use Getopt::Long;
my $phrase;
my $min;
my $phrase_table;
my $filtered_table;
GetOptions( 'min=f' => \$min,
'out=s' => \$filtered_table,
'in=s' => \$phrase_table);
die "ERROR: must give threshold and phrase table input file and output file\n"
unless ($min && $phrase_table && $filtered_table);
die "ERROR: file $phrase_table does not exist\n" unless (-e $phrase_table);
open (PHRASETABLE, "<$phrase_table") or die "FATAL: Could not open phrase table
$phrase_table\n";;
open (FILTEREDTABLE, ">$filtered_table") or die "FATAL: Could not open phrase
table $filtered_table\n";;
while (my $line = <PHRASETABLE>)
{
chomp $line;
my @columns = split ('\|\|\|', $line);
# check that file is a well formatted phrase table
if (scalar @columns < 4)
{
die "ERROR: input file is not a well formatted phrase table. A
phrase table must have at least four colums each column separated by |||\n";
}
# get the probability and check it is less than the threshold
my @scores = split /\s+/, $columns[2];
if ($scores[3] > $min)
{
print FILTEREDTABLE $line."\n";;
}
}
________________________________________
From: [email protected] <[email protected]> on behalf
of Rico Sennrich <[email protected]>
Sent: Wednesday, June 17, 2015 7:17 PM
To: [email protected]
Subject: Re: [Moses-support] Major bug found in Moses
Read, James C <jcread@...> writes:
>
> Actually the approximation I expect to be:
>
> p(e|f)=p(f|e)
>
> Why would you expect this to give poor results if the TM is well trained?
Surely the results of my filtering
> experiments provve otherwise.
>
> James
I recommend you read the following:
https://en.wikipedia.org/wiki/Confusion_of_the_inverse
you don't explain which score you use for filtering (do you take one of the
scores, their sum, their product, or something else?), but I expect you
(mostly) keep the phrase pairs with a high p(e|f), which is the best thing
to do when you don't have a language model.
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support