Hi Vincent,

most LM training pipelines add the <s> and </s> (or whatever other symbols
are used) by themselves; do not add them to the input unless the
instructions for the particular LM specifically tell you so.

Whether sentences in the LM training data should end in punctuation or not
is a matter of corpus cleanup: The LM training data should be as similar as
possible to what you would like your translation output to be.

- Uli

On Sat, Dec 5, 2015 at 12:40 PM, Vincent Nguyen <[email protected]> wrote:

>
> Tom, Uli thank you guys.
> Very clear.
>
> Now how important is it to have a "end of sentence" delimiter in the
> language model (not talking of the LF stuff) ?
> Should each line in the LM end with a "." or equivalent (excla-mark,
> question-mark, ...) ?
> I saw some LM (especially for ASR) where the training text ends each
> line with a specific delimiter </s>
>
>
>
>
> Le 05/12/2015 03:39, Tom Hoar a écrit :
> > Here's another perspective. The concept of a what should be translated
> > as a "sentence" during production depends on the training data and
> > tuning set that created the model. I like Ulrich's input. The period
> > (question mark, exclamation mark, etc) are just tokens. The newline
> > marker tells moses, "start translating a new job with all the tokens
> > before me."
> >
> > Let's say you train your translation model with a parallel corpus broken
> > down into paired part-of-speed phrases (noun phrases, verb phrases,
> > object phrases, etc.). Then build your language model using the target
> > half of the part-of-speech corpus. Finally, tune your SMT model using
> > this TM/LM pair and a tuning set with part-of-speech pairs. Your
> > translation production input should also be broken into those same
> > part-of-speech phrases to achieve optimal results. With such a model,
> > you will get degraded results if you translate a complete sentence or a
> > paragraph (multi-sentences).
> >
> > Here's an modified approach. Train a translation model with the same
> > part-of-speech parallel corpus. Then, use a different version of the
> > target language corpus with complete sentences (i.e. broken by sentence
> > breaks like full-stops, question marks, etc.). Next, tune your SMT model
> > with a tuning set of paired complete sentences that match the LM's
> > breaks. The tuning process optimizes performance for that type of input.
> > Therefore, your optimized translation results will mirror the LM corpus
> > and matched tuning set. You will get degraded results if you translate
> > part-of-speech phrases or multi-sentences or complete paragraphs.
> >
> > We call these "things" models because they're supposed to be a miniature
> > representation of a larger universe. So, you'll always get the best
> > results when your production input matches the input side of your tuning
> > set.
> >
> > Re newline markers, I think Ulrich's "Mac: CR" is for the legacy Mac OS.
> > The current OS X uses Posix/Linux LF. We have not tested our
> > cross-platform updates with the older Mac CR and I suspect it will not
> > work. So I suggest using either CRLF or LF, which we have extensively
> > using across Windows and Posix systems.
> >
> > Tom
> >
> >
> > On 12/5/2015 6:13 AM, [email protected] wrote:
> >> Date: Fri, 4 Dec 2015 23:13:10 +0000
> >> From: Ulrich Germann<[email protected]>
> >> Subject: Re: [Moses-support] decoder question
> >> To: Vincent Nguyen<[email protected]>
> >> Cc: moses-support<[email protected]>
> >>
> >> Hi Vincent,
> >>
> >> as far as Moses is concerned, the end of  a sentence is marked by
> whatever
> >> the end-of-line marker is on the respective OS (Win: CRLF, Linux: LF,
> Mac:
> >> CR, apparently). A period is treated as a plain old token. The purpose
> of
> >> the sentence splitter that Kenneth mentioned is to tell Moses what the
> >> "sentence" boundaries are.
> >>
> >> The language model has a concept of sentences beginning and ending and
> >> usually doesn't like periods anywhere except at the end of a sentence,
> so
> >> it'll down-vote translation hypotheses containing isolated periods.
> >>
> >> - Uli
> > _______________________________________________
> > Moses-support mailing list
> > [email protected]
> > http://mailman.mit.edu/mailman/listinfo/moses-support
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 
Ulrich Germann
Senior Researcher
School of Informatics
University of Edinburgh
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to