Daniel: The http://www.statmt.org/moses/?n=FactoredTraining.PrepareTraining refers to preparing the parallel corpus files that become the input to the train-model.perl script before training. The "Sentence splitter" tool createsthis type of data.
The http://www.statmt.org/moses/?n=FactoredTraining.PrepareData page refers to sub-step one within the train-model.perl script during training. The term "sentence" in the context of Moses parallel corpus equates to the segtype="sentence" and segtype="phrase" attribute of a TMX file's <tu> tag. Tom On Thu, 27 Oct 2011 00:13:24 +0200, Daniel Schaut <[email protected]> wrote: > Hi all, > > I've got two quick questions regarding the data structure of a > prepared > parallel corpus before and after an alignment process. I'm a bit > confused on > my side here regarding the term alignment and how the data structure > should > be organized accordingly to call train-model.perl. I'll put an > example of my > pre-processed corpus (without markup, limited char count, > sentence-splitted, > lowercased and tokenized) to illustrate my situation: > > http://www.statmt.org/moses/?n=FactoredTraining.PrepareTraining reads > "Training data has to be provided sentence aligned (one sentence per > line), > in two files, one for the foreign sentences, one for the English > sentences." > > followed by an example that looks like example A. > > Example A: Data structure of a sentence-splitted corpus > File src File tgt > abc def ghi , jkl mno pqr . abc def ghi , jkl mno pqr . > dfg fgd dfdf kuki i. fgfdg fgfg zuz ycvb . > trtrt jjkhkj uzu dhfg jgjgfj . Fbfgjgj gjhgjg jkhkh hkjl . > . . > > That's perfectly clear, but when continuing reading, I stumbled over > > http://www.statmt.org/moses/?n=FactoredTraining.PrepareData which > reads > "The sentence-aligned corpus now looks like this:" > > followed by an example that is similar to example B. > > Example B: Data scructure of a sentence-aligned file > > Aligned file > SEN ID 1 > 23 343 4343 34343 3434 12 > 656 65654 3243 565 12 > SEN ID 2 > 454 5656 89898 5454 12 > 435325 5646 878 12 > > Furthermore, section "Sentence splitter" of README downloaded from > www.statmt.org/europarl/v6/tools.tgz reads > "Uses punctuation and Capitalization clues to split paragraphs of > sentences into files with one sentence per line. For example: > > This is a paragraph. It contains several sentences. "But why," you > ask? > > goes to: > > This is a paragraph. > It contains several sentences. > "But why," you ask?" > > To conclude, ". sentence aligned (one sentence per line), in two > files,." > refers to another concept, namely, sentence splitting??? So, when > speaking > of aligning a corpus at sentence level in order to a train a > translation > model with train-model.perl; are you referring to sentence splitting > (data > structure of example A) or actual alignment at sentence level > (example B)??? > > Thanks a lot, > > Daniel > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
