This sounds like our workaround.  Just to make sure I understand, Tom, it 
sounds like you add your own extra markers to everything, both for alignment 
and language modeling, so the parallel files look like this (using <ss> and 
</ss> instead of your music symbols):

  <ss> das ist ein kleines haus . </ss>
  <ss> this is a small house . </ss>

and the language modeling files look like this:

  <s> <ss> I like ice cream . </ss> </s>

And finally the input to the decoder looks like this:

  <ss> alles klar , herr kommissar ? </ss>

So then Moses adds its own markers internally to the input, and blithely treats 
<ss> and </ss> as ordinary tokens.  Is this your setup?  We did find that our 
markers occasionally get inserted mid-sentence, and a post-process is necessary 
to remove them.  Just out of curiosity, does that ever happen for you?

Thanks.

- JB

On Feb 12, 2013, at 10:53 , Tom Hoar wrote:

> Based on last year's eos marker discussions, we started using alternate 
> sos/eos markers in both parallel and lm corpora. We settled on two obscure 
> UTF-8 characters U+1D179 Musical Symbol Begin Phrase and U+1D17A Musical 
> Symbol End Phrase. As in standard corpus preparation, the parallel corpora 
> does not use <s></s> and lm corpora does. We've seen significant improvement 
> in results without the need to reordering placement of <s></s> tags.
> 
>  
> On 2013-02-11 00:43, Kenneth Heafield wrote:
> 
>> On 02/10/13 17:21, John Joseph Morgan wrote:
>>> Hello all, My understanding is that and end of sentence marker is inserted 
>>> by the decoder at some point in the decoding process to give the complete 
>>> sentence higher probability than shorter segments of the sentence. Is this 
>>> correct?
>> No.  Inserting the eos marker gives the complete sentence lower 
>> probability.  p(</s> | foo bar .) < 1.  It's inserted to model the end 
>> of sentence.
>> 
>>> If so, can the decoder be configured to not insert the eos marker? srilm's 
>>> ngram-count has a -no-eos option, is there a similar option for the decoder?
>> There is no command line option to disable </s>.
>>> What are the relevant files where this is coded?
>> For phrase-based KenLM, moses/LM/Ken.cpp:255.  For phrase-based with 
>> other lms, moses/LM/Implementation.cpp near 171.  For syntax, see 
>> moses/Sentence.cpp near 187 but beware that </s> controls when the glue 
>> rule applies.
>> 
>>> Thanks, John _______________________________________________ Moses-support 
>>> mailing list [email protected] 
>>> http://mailman.mit.edu/mailman/listinfo/moses-support
>> _______________________________________________
>> Moses-support mailing list
>> 
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to