[Moses-support] Pre- and post-processing of corpus files: Alignment

Daniel Schaut Wed, 26 Oct 2011 15:14:19 -0700

Hi all,

I've got two quick questions regarding the data structure of a prepared
parallel corpus before and after an alignment process. I'm a bit confused on
my side here regarding the term alignment and how the data structure should
be organized accordingly to call train-model.perl. I'll put an example of my
pre-processed corpus (without markup, limited char count, sentence-splitted,
lowercased and tokenized) to illustrate my situation:


http://www.statmt.org/moses/?n=FactoredTraining.PrepareTraining reads
"Training data has to be provided sentence aligned (one sentence per line),
in two files, one for the foreign sentences, one for the English sentences."

followed by an example that looks like example A.

Example A: Data structure of a sentence-splitted corpus
File src        File tgt        
abc def ghi , jkl mno pqr .     abc def ghi , jkl mno pqr .     
dfg fgd dfdf kuki i.    fgfdg fgfg zuz ycvb .   
trtrt jjkhkj uzu dhfg jgjgfj .  Fbfgjgj gjhgjg jkhkh hkjl .     
.       .       

That's perfectly clear, but when continuing reading, I stumbled over

http://www.statmt.org/moses/?n=FactoredTraining.PrepareData which reads
"The sentence-aligned corpus now looks like this:"

followed by an example that is similar to example B.

Example B: Data scructure of a sentence-aligned file

Aligned file    
SEN ID 1        
23 343 4343 34343 3434 12       
656 65654 3243 565 12   
SEN ID 2        
454 5656 89898 5454 12  
435325 5646 878 12      

Furthermore, section "Sentence splitter" of README downloaded from
www.statmt.org/europarl/v6/tools.tgz reads
"Uses punctuation and Capitalization clues to split paragraphs of 
sentences into files with one sentence per line. For example:

This is a paragraph. It contains several sentences. "But why," you ask?

goes to:

This is a paragraph.
It contains several sentences.
"But why," you ask?"

To conclude, ". sentence aligned (one sentence per line), in two files,."
refers to another concept, namely, sentence splitting??? So, when speaking
of aligning a corpus at sentence level in order to a train a translation
model with train-model.perl; are you referring to sentence splitting (data
structure of example A) or actual alignment at sentence level (example B)???

Thanks a lot,

Daniel

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] Pre- and post-processing of corpus files: Alignment

Reply via email to