Re: [Moses-support] Pre- and post-processing of corpus files: Alignment

Tom Hoar Wed, 26 Oct 2011 17:40:58 -0700

 Daniel:

 The http://www.statmt.org/moses/?n=FactoredTraining.PrepareTraining 
 refers
 to preparing the parallel corpus files that become the input to the
 train-model.perl script before training. The "Sentence splitter" tool
 createsthis type of data.


 The http://www.statmt.org/moses/?n=FactoredTraining.PrepareData page 
 refers
 to sub-step one within the train-model.perl script during training.

 The term "sentence" in the context of Moses parallel corpus equates to 
 the
 segtype="sentence" and segtype="phrase" attribute of a TMX file's <tu> 
 tag.


 Tom


 On Thu, 27 Oct 2011 00:13:24 +0200, Daniel Schaut 
 <[email protected]> wrote:
> Hi all,
>
> I've got two quick questions regarding the data structure of a 
> prepared
> parallel corpus before and after an alignment process. I'm a bit 
> confused on
> my side here regarding the term alignment and how the data structure 
> should
> be organized accordingly to call train-model.perl. I'll put an 
> example of my
> pre-processed corpus (without markup, limited char count, 
> sentence-splitted,
> lowercased and tokenized) to illustrate my situation:
>
> http://www.statmt.org/moses/?n=FactoredTraining.PrepareTraining reads
> "Training data has to be provided sentence aligned (one sentence per 
> line),
> in two files, one for the foreign sentences, one for the English 
> sentences."
>
> followed by an example that looks like example A.
>
> Example A: Data structure of a sentence-splitted corpus
> File src      File tgt
> abc def ghi , jkl mno pqr .   abc def ghi , jkl mno pqr .
> dfg fgd dfdf kuki i.  fgfdg fgfg zuz ycvb .
> trtrt jjkhkj uzu dhfg jgjgfj .        Fbfgjgj gjhgjg jkhkh hkjl .
> .     .
>
> That's perfectly clear, but when continuing reading, I stumbled over
>
> http://www.statmt.org/moses/?n=FactoredTraining.PrepareData which 
> reads
> "The sentence-aligned corpus now looks like this:"
>
> followed by an example that is similar to example B.
>
> Example B: Data scructure of a sentence-aligned file
>
> Aligned file
> SEN ID 1
> 23 343 4343 34343 3434 12
> 656 65654 3243 565 12
> SEN ID 2
> 454 5656 89898 5454 12
> 435325 5646 878 12
>
> Furthermore, section "Sentence splitter" of README downloaded from
> www.statmt.org/europarl/v6/tools.tgz reads
> "Uses punctuation and Capitalization clues to split paragraphs of
> sentences into files with one sentence per line. For example:
>
> This is a paragraph. It contains several sentences. "But why," you 
> ask?
>
> goes to:
>
> This is a paragraph.
> It contains several sentences.
> "But why," you ask?"
>
> To conclude, ". sentence aligned (one sentence per line), in two 
> files,."
> refers to another concept, namely, sentence splitting??? So, when 
> speaking
> of aligning a corpus at sentence level in order to a train a 
> translation
> model with train-model.perl; are you referring to sentence splitting 
> (data
> structure of example A) or actual alignment at sentence level 
> (example B)???
>
> Thanks a lot,
>
> Daniel
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Pre- and post-processing of corpus files: Alignment

Reply via email to