After giza++ is done iterating I normally use the phrase tables for various
purposes. I currently work on a little project of translations and I'd like
to retrieve parallel sentences for translations with low probabilities. So
I look up in the phrase table translation equivalents was low - but still
valid - probability, and my aim is to retrieve those sentences which
include the source-language phrase in a sentence and the respective
target-language sentences which contain the lower probability translation
Looked at by example, consider the preposition 'fii' in Arabic which
translates normally to 'in', then to NULL, then to 'on', 'at', 'by' 'into'
and more (trained on UN corpus). How would you retrieve lines where the
source sentences include 'fii', and the target ones NULL or 'at' but not
'in. Since 'in' is very common generally, searching for lines which *do
not* include 'in', but do include 'at' or NULL yield very results with low
recall, and also, it's not necessary precise, since these low-probability
equivalents can just as well be translations of something else (and they
I was wondering:
(a) whether in one of the d-output files there is information about the
most recent probability update and which sentences these updates were taken
(b) whether using this method to retrieve the desired sentences is viable?
(c) are there better ways to go about it?
(d) is anyone for collaborating on this project which I can elaborate in
Thanks a lot,
Moses-support mailing list