Hi, I'm using a 7162 line paragraph-aligned corpus. Unfortunately the translation within the paragraph sometimes don't have the sentences aligned, i.e. in one language the sentence could be one long sentence, and in another language the sentence could have clauses broken up into multiple sentences, hence I'm running GIZA++ on paragraphs.
It works partially, but the alignment of words is often wrong or it's missing matches that should have been made. I set the "maxsentencelength" configuration file parameter to 350, though most of the paragraphs are around 100 or fewer words. Q1: What difference do you estimate I should expect between using paragraphs vs. sentences? Q2: Are there GIZA++ parameters I could tune to improve the alignment? Q3: If I concatenated multiple corpora, would the alignment output likely improve? I could preprocess the corpus, breaking up the paragraphs where the number of sentences match, but there may be some cases where the sentences don't align, where multiple sentences within the paragraph were joined or split differently, such that the sentence count of the paragraph is the same, but the sentences don't align. Q4: How big of effect would these bad sentence alignments have on the rest of the alignments? Q5: Any ideas for how to get better word alignment with these corpora that I have, either with GIZA++ or a different tool? I'm using the word alignment in a language study tool. For example, I have the text for a book in both English and Marshallese, but language resources for Marshallese are scarce. In my tool I associate the alignment information with the text, and also generate a dictionary using the alignment output (or optionally the *.dict.actual.ti.final dictionary list output). In one page I show the aligned sentences, and in another you can click on words to get both the alignment definition, and the dictionary definitions. For each source word dictionary entry, I sort the target definitions by descending frequency (or probability if using the *.dict.actual.ti.final dictionary list output), and then chop off the list after a certain number, as otherwise there will be a lot of bad or spurious definitions included. Thanks! -John -- John Thompson [email protected] https://www.jtlanguage.com 1-909-283-4364 (home) 1-909-283-5642 (cell)
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
