Tom, Uli thank you guys. Very clear. Now how important is it to have a "end of sentence" delimiter in the language model (not talking of the LF stuff) ? Should each line in the LM end with a "." or equivalent (excla-mark, question-mark, ...) ? I saw some LM (especially for ASR) where the training text ends each line with a specific delimiter </s>
Le 05/12/2015 03:39, Tom Hoar a écrit : > Here's another perspective. The concept of a what should be translated > as a "sentence" during production depends on the training data and > tuning set that created the model. I like Ulrich's input. The period > (question mark, exclamation mark, etc) are just tokens. The newline > marker tells moses, "start translating a new job with all the tokens > before me." > > Let's say you train your translation model with a parallel corpus broken > down into paired part-of-speed phrases (noun phrases, verb phrases, > object phrases, etc.). Then build your language model using the target > half of the part-of-speech corpus. Finally, tune your SMT model using > this TM/LM pair and a tuning set with part-of-speech pairs. Your > translation production input should also be broken into those same > part-of-speech phrases to achieve optimal results. With such a model, > you will get degraded results if you translate a complete sentence or a > paragraph (multi-sentences). > > Here's an modified approach. Train a translation model with the same > part-of-speech parallel corpus. Then, use a different version of the > target language corpus with complete sentences (i.e. broken by sentence > breaks like full-stops, question marks, etc.). Next, tune your SMT model > with a tuning set of paired complete sentences that match the LM's > breaks. The tuning process optimizes performance for that type of input. > Therefore, your optimized translation results will mirror the LM corpus > and matched tuning set. You will get degraded results if you translate > part-of-speech phrases or multi-sentences or complete paragraphs. > > We call these "things" models because they're supposed to be a miniature > representation of a larger universe. So, you'll always get the best > results when your production input matches the input side of your tuning > set. > > Re newline markers, I think Ulrich's "Mac: CR" is for the legacy Mac OS. > The current OS X uses Posix/Linux LF. We have not tested our > cross-platform updates with the older Mac CR and I suspect it will not > work. So I suggest using either CRLF or LF, which we have extensively > using across Windows and Posix systems. > > Tom > > > On 12/5/2015 6:13 AM, [email protected] wrote: >> Date: Fri, 4 Dec 2015 23:13:10 +0000 >> From: Ulrich Germann<[email protected]> >> Subject: Re: [Moses-support] decoder question >> To: Vincent Nguyen<[email protected]> >> Cc: moses-support<[email protected]> >> >> Hi Vincent, >> >> as far as Moses is concerned, the end of a sentence is marked by whatever >> the end-of-line marker is on the respective OS (Win: CRLF, Linux: LF, Mac: >> CR, apparently). A period is treated as a plain old token. The purpose of >> the sentence splitter that Kenneth mentioned is to tell Moses what the >> "sentence" boundaries are. >> >> The language model has a concept of sentences beginning and ending and >> usually doesn't like periods anywhere except at the end of a sentence, so >> it'll down-vote translation hypotheses containing isolated periods. >> >> - Uli > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
