hey arda
fyi, below
-------- Original Message --------
Subject: Re: R: R: Alignment information in phrase table?
Date: Sun, 18 Jul 2010 21:08:26 +0100
From: Hieu Hoang <[email protected]>
To: Christian Hardmeier <[email protected]>
CC: [email protected] <[email protected]>, Yu Chen <[email protected]>,
Philipp Koehn <[email protected]>, Andreas Eisele <[email protected]>,
Nicola Bertoldi <[email protected]>, Philip Williams
<[email protected]>, [email protected], [email protected]
hi guys
christian& i were talking @ acl& thought it would be a good idea to
put the alignment info back into the phrase table. This time, we've
thought a little about it and try& support it so it doesn't fall out again.
when you run the ./score part of the phrase extraction, add the argument
--WordAlignment
and it'll copy the alignment info from the giza++ files to the phrase table.
The format of the final phrase table has changed a bit, so apologies if
that messes up your scripts. The old format was meant to do some othing
things, we didn't think about memory consumption or speed. This new
format is simpler& should be ok, and is also the same as the
moses-chart decoding format so you can swap in the hiero/syntax stuff
with little effort.
It's now:
source ||| target ||| alignment ||| scores ||| counts
eg.
Mushariff letzer Act ? ||| Mushariff 's last act ? ||| 0-0 0-1 1-2
2-3 3-4 ||| 0.5 0.414 0.2343 0.2354 2.718 ||| 14 12
let's see what happens. we should let the wider audience know once you
guys kick the tires on it.
On 09/07/2010 13:47, Christian Hardmeier wrote:
You're right about that thread, strange. I wonder if this wasn't fixed at some
point. I hope it was, at least, because I never noticed anything like that. :-)
Anyhow I'm sure that memscore doesn't do this.
I regularly use the word alignments, so if there's a serious problem, I hope
I'll notice, and I will need to fix it, I suppose... Would be cool if you could
add it back in!
I'm coming to ACL, so we can talk about details next week.
Thanks,
Christian
________________________________________
Da: Hieu Hoang [[email protected]]
Inviato: venerdì 9 luglio 2010 14.28
A: Christian Hardmeier
Cc: [email protected]; Yu Chen; Philipp Koehn; Andreas Eisele; Nicola Bertoldi
Oggetto: Re: R: Alignment information in phrase table?
i think you have a point, it's a popular& useful feature that should be
put back in.
btw, i found a thread about differences in training from a 2 yrs ago:
http://article.gmane.org/gmane.comp.nlp.moses.user/1267
i'm not sure if the problem still exist, i was surprised by it as well,
but i'm reticent to do anything that reduces performance.
i think adding it into the training as an option would be easy and i can
add it in. However, i don't use it so any problems would slip under my
radar. You guys want to look after it once it's in?
Are you coming to ACL? can talk about it then, or by skype afterwards if
we want to go ahead with it.
On 09/07/2010 10:03, Christian Hardmeier wrote:
Hi Hieu
I don't have much to say about word alignments in the decoder - since I've
found out that it's quite easy to obtain word alignments by putting the
alignment info in a second factor in the phrase table, I don't need special
code in the decoder to deal with this.
However, in my opinion removing the alignments from the training scripts was a
serious mistake. At the very least, they should be made optional. Why do you
want to remove working functionality that many people want to use (witness the
frequent requests for this feature on the mailing list) just because it may
produce slightly inferior probability estimates in a few cases? I'm completely
dependent on the word alignments for my current work, and if they're not output
any more, this means I can't upgrade to the latest trunk, which is a hassle.
By the way, I don't even think the problem with conflicting alignments really
exists. I'm sure you don't get duplicate entries in the phrase table if you use
memscore, and I would be rather surprised to find out you do with the classical
training code. In fact, this problem is discussed in my paper for the last MT
Marathon:http://www.mt-archive.info/MTMarathon-2010-Hardmeier.pdf
The second paragraph of section 2.2 tells you what memscore does when faced
with conflicting alignments:
"When a phrase pair occurs with different alignments in the input, the most
frequent alignment is output. Ties are broken arbitrarily."
I don't remember exactly what Philipp's scripts do, but I believe it's
something similar.
The last paragraph of section 2.1 contains a discussion about the computation
of lexical weight scores in the presence of conflicting alignments. Here,
memscore behaves slightly differently from Philipp's scripts, but neither of
them outputs duplicate entries.
Couldn't you ask whoever removed word alignments from the training to roll
back this change please? If they absolutely don't want to have the alignments
for whatever reason, they should add a switch, but not just delete code some
people are using.
Cheers,
Christian
________________________________________
Da: Hieu Hoang [[email protected]]
Inviato: giovedì 8 luglio 2010 13.05
A:[email protected]
Cc: Yu Chen; Philipp Koehn; Christian Hardmeier; Andreas Eisele; Nicola
Bertoldi
Oggetto: Re: Alignment information in phrase table?
Hi Tracey
there were problems with memory consumption& slowness in the decoder.
Josh tried to contain that about 2 yrs ago by only loading alignment
when it was need.
http://mosesdecoder.svn.sourceforge.net/viewvc/mosesdecoder?view=revision&sortby=file&revision=1941
however, the implementation was still unecessarily memory hungry& slow.
We also noticed that there were small differences in the training
routine with the aligment info. I can't remember the details& i can't
find the emails, but it goes something like:
If there is 2 phrase pairs in the training corpus that have exactly
the same source& target, but only differ in the alignment, eg
a b ||| A B ||| 0-0 1-1
a b ||| A B ||| 0-0 0-1 1-1
then the training routines will create 2 entries in the phrase table.
This make decoding slightly worse so we rolled it back too.
the main problem with it was that nicola& i had a hand in it but
neither of us really looked after the code. When bugs were found, it was
easier to rollback than fix the problem.
the current decoder once again carry alignment info which is used to
store the co-index for the hiero/syntax system, but it can store word
level alignement too. It's built into the new on-disk pt format, and
isn't too memory hungry.
i think it'll be nice to have the alignment info back in. But someone
has to take charge and be prepared to fix it if bugs gets found
On 08/07/2010 10:57, Yu Chen wrote:
Dear Philipp and Hieu,
I just noticed the model training script in moses no longer output the
best alignment information for non-hierarchical phrase pairs in the
phrase table. (line 447 in
$MOSESTRUNK/scripts/training/phrase-extract/score.cpp) Besides, the
options for the decoder to print out the word alignment information
have been disabled for a while.
(http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc6)
Is there a particular reason doing so? I figured it would be better to
ask you before send this question to the mailling list. This function
is fairly essential in our hybrid setup. Although we still have the
previous version, it would be problematic for us to try out new
features in moses. Looking forward to your answers! :)
Best regards,
Yu
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support