Tom, Uli thank you guys.
Very clear.

Now how important is it to have a "end of sentence" delimiter in the 
language model (not talking of the LF stuff) ?
Should each line in the LM end with a "." or equivalent (excla-mark, 
question-mark, ...) ?
I saw some LM (especially for ASR) where the training text ends each 
line with a specific delimiter </s>




Le 05/12/2015 03:39, Tom Hoar a écrit :
> Here's another perspective. The concept of a what should be translated
> as a "sentence" during production depends on the training data and
> tuning set that created the model. I like Ulrich's input. The period
> (question mark, exclamation mark, etc) are just tokens. The newline
> marker tells moses, "start translating a new job with all the tokens
> before me."
>
> Let's say you train your translation model with a parallel corpus broken
> down into paired part-of-speed phrases (noun phrases, verb phrases,
> object phrases, etc.). Then build your language model using the target
> half of the part-of-speech corpus. Finally, tune your SMT model using
> this TM/LM pair and a tuning set with part-of-speech pairs. Your
> translation production input should also be broken into those same
> part-of-speech phrases to achieve optimal results. With such a model,
> you will get degraded results if you translate a complete sentence or a
> paragraph (multi-sentences).
>
> Here's an modified approach. Train a translation model with the same
> part-of-speech parallel corpus. Then, use a different version of the
> target language corpus with complete sentences (i.e. broken by sentence
> breaks like full-stops, question marks, etc.). Next, tune your SMT model
> with a tuning set of paired complete sentences that match the LM's
> breaks. The tuning process optimizes performance for that type of input.
> Therefore, your optimized translation results will mirror the LM corpus
> and matched tuning set. You will get degraded results if you translate
> part-of-speech phrases or multi-sentences or complete paragraphs.
>
> We call these "things" models because they're supposed to be a miniature
> representation of a larger universe. So, you'll always get the best
> results when your production input matches the input side of your tuning
> set.
>
> Re newline markers, I think Ulrich's "Mac: CR" is for the legacy Mac OS.
> The current OS X uses Posix/Linux LF. We have not tested our
> cross-platform updates with the older Mac CR and I suspect it will not
> work. So I suggest using either CRLF or LF, which we have extensively
> using across Windows and Posix systems.
>
> Tom
>
>
> On 12/5/2015 6:13 AM, [email protected] wrote:
>> Date: Fri, 4 Dec 2015 23:13:10 +0000
>> From: Ulrich Germann<[email protected]>
>> Subject: Re: [Moses-support] decoder question
>> To: Vincent Nguyen<[email protected]>
>> Cc: moses-support<[email protected]>
>>
>> Hi Vincent,
>>
>> as far as Moses is concerned, the end of  a sentence is marked by whatever
>> the end-of-line marker is on the respective OS (Win: CRLF, Linux: LF, Mac:
>> CR, apparently). A period is treated as a plain old token. The purpose of
>> the sentence splitter that Kenneth mentioned is to tell Moses what the
>> "sentence" boundaries are.
>>
>> The language model has a concept of sentences beginning and ending and
>> usually doesn't like periods anywhere except at the end of a sentence, so
>> it'll down-vote translation hypotheses containing isolated periods.
>>
>> - Uli
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to