Here's another perspective. The concept of a what should be translated 
as a "sentence" during production depends on the training data and 
tuning set that created the model. I like Ulrich's input. The period 
(question mark, exclamation mark, etc) are just tokens. The newline 
marker tells moses, "start translating a new job with all the tokens 
before me."

Let's say you train your translation model with a parallel corpus broken 
down into paired part-of-speed phrases (noun phrases, verb phrases, 
object phrases, etc.). Then build your language model using the target 
half of the part-of-speech corpus. Finally, tune your SMT model using 
this TM/LM pair and a tuning set with part-of-speech pairs. Your 
translation production input should also be broken into those same 
part-of-speech phrases to achieve optimal results. With such a model, 
you will get degraded results if you translate a complete sentence or a 
paragraph (multi-sentences).

Here's an modified approach. Train a translation model with the same 
part-of-speech parallel corpus. Then, use a different version of the 
target language corpus with complete sentences (i.e. broken by sentence 
breaks like full-stops, question marks, etc.). Next, tune your SMT model 
with a tuning set of paired complete sentences that match the LM's 
breaks. The tuning process optimizes performance for that type of input. 
Therefore, your optimized translation results will mirror the LM corpus 
and matched tuning set. You will get degraded results if you translate 
part-of-speech phrases or multi-sentences or complete paragraphs.

We call these "things" models because they're supposed to be a miniature 
representation of a larger universe. So, you'll always get the best 
results when your production input matches the input side of your tuning 
set.

Re newline markers, I think Ulrich's "Mac: CR" is for the legacy Mac OS. 
The current OS X uses Posix/Linux LF. We have not tested our 
cross-platform updates with the older Mac CR and I suspect it will not 
work. So I suggest using either CRLF or LF, which we have extensively 
using across Windows and Posix systems.

Tom


On 12/5/2015 6:13 AM, [email protected] wrote:
> Date: Fri, 4 Dec 2015 23:13:10 +0000
> From: Ulrich Germann<[email protected]>
> Subject: Re: [Moses-support] decoder question
> To: Vincent Nguyen<[email protected]>
> Cc: moses-support<[email protected]>
>
> Hi Vincent,
>
> as far as Moses is concerned, the end of  a sentence is marked by whatever
> the end-of-line marker is on the respective OS (Win: CRLF, Linux: LF, Mac:
> CR, apparently). A period is treated as a plain old token. The purpose of
> the sentence splitter that Kenneth mentioned is to tell Moses what the
> "sentence" boundaries are.
>
> The language model has a concept of sentences beginning and ending and
> usually doesn't like periods anywhere except at the end of a sentence, so
> it'll down-vote translation hypotheses containing isolated periods.
>
> - Uli

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to