[Moses-support] GIZA++ word alignment with paragraphs

John Thompson Wed, 08 Jul 2020 18:22:22 -0700

Hi,

I'm using a 7162 line paragraph-aligned corpus. Unfortunately the
translation within the paragraph sometimes don't have the sentences
aligned, i.e. in one language the sentence could be one long sentence, and
in another language the sentence could have clauses broken up into multiple
sentences, hence I'm running GIZA++ on paragraphs.


It works partially, but the alignment of words is often wrong or it's
missing matches that should have been made.

I set the "maxsentencelength" configuration file parameter to 350, though
most of the paragraphs are around 100 or fewer words.

Q1: What difference do you estimate I should expect between using
paragraphs vs. sentences?

Q2: Are there GIZA++ parameters I could tune to improve the alignment?

Q3: If I concatenated multiple corpora, would the alignment output likely
improve?

I could preprocess the corpus, breaking up the paragraphs where the number
of sentences match, but there may be some cases where the sentences don't
align, where multiple sentences within the paragraph were joined or split
differently, such that the sentence count of the paragraph is the same, but
the sentences don't align.

Q4: How big of effect would these bad sentence alignments have on the rest
of the alignments?

Q5: Any ideas for how to get better word alignment with these corpora that
I have, either with GIZA++ or a different tool?

I'm using the word alignment in a language study tool. For example, I have
the text for a book in both English and Marshallese, but language resources
for Marshallese are scarce. In my tool I associate the alignment
information with the text, and also generate a dictionary using the
alignment output (or optionally the *.dict.actual.ti.final dictionary list
output). In one page I show the aligned sentences, and in another you can
click on words to get both the alignment definition, and the dictionary
definitions.  For each source word dictionary entry, I sort the target
definitions by descending frequency (or probability if using the
*.dict.actual.ti.final dictionary list output), and then chop off the list
after a certain number, as otherwise there will be a lot of bad or spurious
definitions included.

Thanks!

-John

-- 
John Thompson
[email protected]
https://www.jtlanguage.com
1-909-283-4364 (home)
1-909-283-5642 (cell)

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] GIZA++ word alignment with paragraphs

Reply via email to